References

1 Introduction to Regression

2 Notation

2.1 Empirical/Sample Mean

  • empirical mean is defined as \[\bar X = \frac{1}{n}\sum_{i=1}^n X_i\]
  • centering the random variable is defined as \[\tilde X_i = X_i - \bar X\]
    • mean of \(\tilde X_i\) = 0

2.2 Empirical/Sample Standard Deviation & Variance

  • empirical variance is defined as \[ S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2 = \frac{1}{n-1} \left( \sum_{i=1}^n X_i^2 - n \bar X ^ 2 \right) \Leftarrow \mbox{shortcut for calculation}\]
  • empirical standard deviation is defined as \(S = \sqrt{S^2}\)
    • average squared distances between the observations and the mean
    • has same units as the data
  • scaling the random variables is defined as \(X_i / S\)
    • standard deviation of \(X_i / S\) = 1

2.3 Normalization

  • normalizing the data/random variable is defined as \[Z_i = \frac{X_i - \bar X}{s}\]
    • empirical mean = 0, empirical standard deviation = 1
    • distribution centered at 0 and normalized data have units equal to # of standard deviations away from the original mean
      • example: \(Z_1 = 2\) means that the data point is 2 standard deviations larger than the original mean
  • normalization makes non-comparable data comparable

2.4 Empirical Covariance & Correlation

  • Let \((X_i, Y_i)\) = pairs of data
  • empirical covariance is defined as \[ Cov(X, Y) = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X) (Y_i - \bar Y) = \frac{1}{n-1}\left( \sum_{i=1}^n X_i Y_i - n \bar X \bar Y\right) \]
    • has units of \(X \times\) units of \(Y\)
  • correlation is defined as \[Cor(X, Y) = \frac{Cov(X, Y)}{S_x S_y}\] where \(S_x\) and \(S_y\) are the estimates of standard deviations for the \(X\) observations and \(Y\) observations, respectively
    • the value is effectively the covariance standardized into a unit-less quantity
    • \(Cor(X, Y) = Cor(Y, X)\)
    • \(-1 \leq Cor(X, Y) \leq 1\)
    • \(Cor(X,Y) = 1\) and \(Cor(X, Y) = -1\) only when the \(X\) or \(Y\) observations fall perfectly on a positive or negative sloped line, respectively
    • \(Cor(X, Y)\) measures the strength of the linear relationship between the \(X\) and \(Y\) data, with stronger relationships as \(Cor(X,Y)\) heads towards -1 or 1
    • \(Cor(X, Y) = 0\) implies no linear relationship

3 Galton’s Data and Least Squares

# load necessary packages/install if needed
library(UsingR); data(galton)
library(ggplot2)
library(reshape)
long <- melt(galton)
g <- ggplot(long, aes(x = value, fill = variable))
g + geom_histogram(colour = "black", binwidth = 1) + facet_grid(. ~ variable)

# function to plot the histograms
myHist <- function(mu){
# calculate the mean squares
mse <- mean((galton$child - mu)^2)
# plot histogram
g <- ggplot(galton, aes(x = child)) + geom_histogram(fill = "salmon",
colour = "black", binwidth=1)
# add vertical line marking the center value mu
g <- g + geom_vline(xintercept = mu, size = 2)
g <- g + ggtitle(paste("mu = ", mu, ", MSE = ", round(mse, 2), sep = ""))
g
}
# manipulate allows the user to change the variable mu to see how the mean squares changes
# library(manipulate)
# manipulate(myHist(mu), mu = slider(62, 74, step = 0.5))]
# mu that minimizes the sum is the empirical mean
myHist(round(mean(galton$child), 4))

library(dplyr)
# constructs table for different combination of parent-child height
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child (in)", "parent (in)", "freq")
# convert to numeric values
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
# filter to only meaningful combinations
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g + scale_size(range = c(1, 10), guide = "none" )
# plot grey circles slightly larger than data as base (achieve an outline effect)
g <- g + geom_point(colour="grey50", aes(size = freq+10))
# plot the accurate data points
g <- g + geom_point(aes(colour=freq, size = freq))
# change the color gradient from default to lightblue -> $white
g <- g + scale_colour_gradient(low = "lightblue", high="white")
g

3.1 Derivation for Least Squares = Empirical Mean (Finding the Minimum)

  • Let \(X_i =\) regressor/predictor, and \(Y_i =\) outcome/result so we want to minimize the squares: \[\sum_{i=1}^n (Y_i - \mu)^2\]
  • Proof is as follows \[ \begin{aligned} \sum_{i=1}^n (Y_i - \mu)^2 & = \sum_{i=1}^n (Y_i - \bar Y + \bar Y - \mu)^2 \Leftarrow \mbox{added} \pm \bar Y \mbox{which is adding 0 to the original equation}\\ (expanding~the~terms)& = \sum_{i=1}^n (Y_i - \bar Y)^2 + 2 \sum_{i=1}^n (Y_i - \bar Y) (\bar Y - \mu) + \sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow (Y_i - \bar Y), (\bar Y - \mu) \mbox{ are the terms}\\ (simplifying) & = \sum_{i=1}^n (Y_i - \bar Y)^2 + 2 (\bar Y - \mu) \sum_{i=1}^n (Y_i - \bar Y) +\sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow (\bar Y - \mu) \mbox{ does not depend on } i \\ (simplifying) & = \sum_{i=1}^n (Y_i - \bar Y)^2 + 2 (\bar Y - \mu) (\sum_{i=1}^n Y_i - n \bar Y) +\sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow \sum_{i=1}^n \bar Y \mbox{ is equivalent to }n \bar Y\\ (simplifying) & = \sum_{i=1}^n (Y_i - \bar Y)^2 + \sum_{i=1}^n (\bar Y - \mu)^2 \Leftarrow \sum_{i=1}^n Y_i - n \bar Y = 0 \mbox{ since } \sum_{i=1}^n Y_i = n \bar Y \\ \sum_{i=1}^n (Y_i - \mu)^2 & \geq \sum_{i=1}^n (Y_i - \bar Y)^2 \Leftarrow \sum_{i=1}^n (\bar Y - \mu)^2 \mbox{ is always} \geq 0 \mbox{ so we can take it out to form the inequality} \end{aligned} \]
    • because of the inequality above, to minimize the sum of the squares \(\sum_{i=1}^n (Y_i - \mu)^2\), \(\bar Y\) must be equal to \(\mu\)
  • An alternative approach to finding the minimum is taking the derivative with respect to \(\mu\) \[ \begin{aligned} \frac{d(\sum_{i=1}^n (Y_i - \mu)^2)}{d\mu} & = 0 \Leftarrow \mbox{setting this equal to 0 to find minimum}\\ -2\sum_{i=1}^n (Y_i - \mu) & = 0 \Leftarrow \mbox{divide by -2 on both sides and move } \mu \mbox{ term over to the right}\\ \sum_{i=1}^n Y_i & = \sum_{i=1}^n \mu \Leftarrow \mbox{for the two sums to be equal, the sums must be equal or}\\ \mu & = \frac{\sum_{i-1}^n Y_i}{n} = \bar Y\\ \end{aligned} \]

4 Regression through the Origin

# centering data
y <- galton$child - mean(galton$child)
x <- galton$parent - mean(galton$parent)
freqData <- as.data.frame(table(x, y))
names(freqData) <- c("child", "parent", "freq")
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
myPlot <- function(beta){
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g + scale_size(range = c(1, 10), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+10))
g <- g + geom_point(aes(colour=freq, size = freq))
g <- g + scale_colour_gradient(low = "lightblue", high="white")
g <- g + geom_abline(intercept = 0, slope = beta, size = 3)
mse <- mean( (y - beta * x) ^2 )
g <- g + ggtitle(paste("beta = ", beta, "mse = ", round(mse, 3)))
g
}
# manipulate allows the user to change the slope to see how the mean squares changes
# manipulate(myPlot(beta), beta = slider(0.6, 1.2, step = 0.02))
# slope that minimizes the sum is calculated as follows
beta <- coef(lm(I(child - mean(child)) ~ I(parent - mean(parent)) - 1, data = galton))[1]
myPlot(round(beta, 4))

4.1 Derivation for \(\beta\)

  • Let \(Y = \beta X\), and \(\hat \beta\) = estimate of \(\beta\), the slope of the least square regression line \[ \begin{aligned} \sum_{i=1}^n (Y_i - X_i \beta)^2 & = \sum_{i=1}^n \left[ (Y_i - X_i \hat \beta) + (X_i \hat \beta - X_i \beta) \right]^2 \Leftarrow \mbox{added } \pm X_i \hat \beta \mbox{ is effectively adding zero}\\ (expanding~the~terms)& = \sum_{i=1}^n (Y_i - X_i \hat \beta)^2 + 2 \sum_{i=1}^n (Y_i - X_i \hat \beta)(X_i \hat \beta - X_i \beta) + \sum_{i=1}^n (X_i \hat \beta - X_i \beta)^2 \\ \sum_{i=1}^n (Y_i - X_i \beta)^2 & \geq \sum_{i=1}^n (Y_i - X_i \hat \beta)^2 + 2 \sum_{i=1}^n (Y_i - X_i \hat \beta)(X_i \hat \beta - X_i \beta) \Leftarrow \sum_{i=1}^n (X_i \hat \beta - X_i \beta)^2 \mbox{ is always positive}\\ & (ignoring~the~second~term~for~now,~for~\hat \beta ~to~be~the~minimizer~of~the~squares, \\ & the~following~must~be~true)\\ \sum_{i=1}^n (Y_i - X_i \beta)^2 & \geq \sum_{i=1}^n (Y_i - X_i \hat \beta)^2 \Leftarrow \hat \beta \mbox{ would have to be the minimizer, because every other } \beta \mbox{ value }\\ & \mbox{creates a least square criteria that is at least as large or larger}\\ & \Rightarrow 2 \sum_{i=1}^n (Y_i - X_i \hat \beta)(X_i \hat \beta - X_i \beta) = 0 \Leftarrow \mbox{if we can make this term zero, then we've found our estimate}\\ (simplifying)& \Rightarrow \sum_{i=1}^n (Y_i - X_i \hat \beta) X_i (\hat \beta - \beta) = 0 \Leftarrow (\hat \beta - \beta) \mbox{ does not depend on }i\\ (simplifying)& \Rightarrow \sum_{i=1}^n (Y_i - X_i \hat \beta) X_i = 0 \\ (solving~for~\hat \beta) & \Rightarrow \hat \beta = \frac{\sum_{i=1}^n Y_i X_i}{\sum_{i=1}^n X_i^2}\\ \end{aligned} \]

  • Example
    • Let \(X_1, X_2, \ldots , X_n = 1\) \[ \begin{aligned} \sum_{i=1}^n (Y_i - X_i \beta)^2 = \sum_{i=1}^n (Y_i - \beta)^2 \\ \Rightarrow \hat \beta = \frac{\sum_{i=1}^n Y_i X_i}{\sum_{i=1}^n X_i^2} = \frac{\sum_{i=1}^n Y_i}{\sum_{i=1}^n 1} = \frac{\sum_{i=1}^n Y_i}{n} = \bar Y \end{aligned} \]
    • Note: this is the result from our previous derivation for least squares = empirical mean

5 Finding the Best Fit Line (Ordinary Least Squares)

5.1 Least Squares Model Fit

  • The least squares model fit to the line \(Y = \beta_0 + \beta_1 X\) through the data pairs \((X_i, Y_i)\) with \(Y_i\) as the outcome obtains the line \(Y = \hat \beta_0 + \hat \beta_1 X\) where \[\hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X\]
    • [slope] \(\hat \beta_1\) has the units of \(Y / X\)
      • \(Cor(Y, X)\) = unit-less
      • \(Sd(Y)\) = has units of \(Y\)
      • \(Sd(X)\) = has units of \(X\)
    • [intercept] \(\hat \beta_0\) has the units of \(Y\)
    • the line passes through the point \((\bar X, \bar Y\))
      • this is evident from equation for \(\beta_0\) (rearrange equation)
  • The slope of the regression line with \(X\) as the outcome and \(Y\) as the predictor is \(Cor(Y, X) Sd(X)/ Sd(Y)\)
  • The slope is the same one you would get if you centered the data, \((X_i - \bar X, Y_i - \bar Y)\), and did a regression through the origin
  • If you normalized the data, \(\{ \frac{X_i - \bar X}{Sd(X)}, \frac{Y_i - \bar Y}{Sd(Y)}\}\), the slope is \(Cor(Y, X)\)

5.2 Derivation for \(\beta_0\) and \(\beta_1\)

  • Let \(Y = \beta_0 + \beta_1 X\), and \(\hat \beta_0\), $_1 be the $ estimates of \(\beta_0\), \(\beta_1\), the intercept and slope of the least square regression line, respectively \[ \begin{aligned} we~want~to~minimize~\sum_{i=1}^n (Y_i - \beta_0 - \beta_1 X_i)^2 & = \sum_{i=1}^n (Y_i^* - \beta_0)^2 ~ ~ ~where~ Y_i^* = Y_i - \beta_1 X_i\\ (solution from 4.1) \sum_{i=1}^n (Y_i^* - \beta_0)^2 & \Rightarrow \hat \beta_0 = \frac{\sum_{i=1}^n Y_i^*}{n} = \frac{\sum_{i=1}^n Y_i - \beta_1 X_i}{n}\\ & \Rightarrow \hat \beta_0 = \frac{\sum_{i=1}^n Y_i}{n} - \beta_1 \frac{\sum_{i=1}^n X_i}{n}\\ & \Rightarrow \hat \beta_0 = \bar Y - \beta_1 \bar X\\ \Longrightarrow \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 X_i)^2 & = \sum_{i=1}^n \left[Y_i - (\bar Y - \beta_1 \bar X) - \beta_1 X_i \right]^2\\ & = \sum_{i=1}^n \left[(Y_i - \bar Y) - (X_i - \bar X)\beta_1 \right]^2\\ (this is exactly regression through the origin) & = \sum_{i=1}^n \left[\tilde Y_i - \tilde X_i \beta_1 \right]^2 ~~~ where~ \tilde Y_i = Y_i - \bar Y, \tilde X_i = X_i - \bar X\\ (solution from 4.1) & \Rightarrow \hat \beta_1 = \frac{\sum_{i=1}^n \tilde Y_i \tilde X_i}{\sum_{i=1}^n \tilde{X_i}^2} = \frac{\sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X)}{\sum_{i=1}^n (X_i - \bar X)^2}\\ & \Rightarrow \hat \beta_1 = \frac{\sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X)/(n-1)}{\sum_{i=1}^n (X_i - \bar X)^2/(n-1)} = \frac{Cov(Y, X)}{Var(X)}\\ & \Rightarrow \hat \beta_1 = Cor(Y, X) \frac{Sd(Y)}{Sd(X)}\\ & \Rightarrow \hat \beta_0 = \bar Y - \hat \beta_1 \bar X\\ \end{aligned} \]

5.3 Examples and R Commands

  • \(\hat \beta_0\) and \(\hat \beta_1\) can be manually calculated through the above formulas
  • coef(lm(y ~ x))) = R command to run the least square regression model on the data with y as the outcome, and x as the regressor
    • coef() = returns the slope and intercept coefficients of the lm results
# outcome
y <- galton$child
# regressor
x <- galton$parent
# slope
beta1 <- cor(y, x) * sd(y) / sd(x)
# intercept
beta0 <- mean(y) - beta1 * mean(x)
# results are the same as using the lm command
results <- rbind("manual" = c(beta0, beta1), "lm(y ~ x)" = coef(lm(y ~ x)))
# set column names
colnames(results) <- c("intercept", "slope")
# print results
results
##           intercept     slope
## manual     23.94153 0.6462906
## lm(y ~ x)  23.94153 0.6462906
  • slope of the best fit line = slope of best fit line through the origin for centered data
  • lm(y ~ x - 1) = forces a regression line to go through the origin (0, 0)
### Regression through the origin yields an equivalent slope if you center the data first
# centering y
yc <- y - mean(y)
# centering x
xc <- x - mean(x)
# slope
beta1 <- sum(yc * xc) / sum(xc ^ 2)
# results are the same as using the lm command
results <- rbind("centered data (manual)" = beta1, "lm(y ~ x)" = coef(lm(y ~ x))[2],
"lm(yc ~ xc - 1)" = coef(lm(yc ~ xc - 1))[1])
# set column names
colnames(results) <- c("slope")
# print results
results
##                            slope
## centered data (manual) 0.6462906
## lm(y ~ x)              0.6462906
## lm(yc ~ xc - 1)        0.6462906
  • slope of best fit line for normalized data = \(Cor(Y, X)\)
### Normalizing variables results in the slope being the correlation
# normalize y
yn <- (y - mean(y))/sd(y)
# normalize x
xn <- (x - mean(x))/sd(x)
# compare correlations
results <- rbind("cor(y, x)" = cor(y, x), "cor(yn, xn)" = cor(yn, xn),
"slope" = coef(lm(yn ~ xn))[2])
# print results
results
##                    xn
## cor(y, x)   0.4587624
## cor(yn, xn) 0.4587624
## slope       0.4587624
  • geom_smooth(method = "lm", formula = y~x) function in ggplot2 = adds regression line and confidence interval to graph
    • formula = y~x = default for the line (argument can be eliminated if y~x produces the line you want)
# constructs table for different combination of parent-child height
freqData <- as.data.frame(table(galton$child, galton$parent))
names(freqData) <- c("child (in)", "parent (in)", "freq")
# convert to numeric values
freqData$child <- as.numeric(as.character(freqData$child))
freqData$parent <- as.numeric(as.character(freqData$parent))
g <- ggplot(filter(freqData, freq > 0), aes(x = parent, y = child))
g <- g + scale_size(range = c(1, 10), guide = "none" )
g <- g + geom_point(colour="grey50", aes(size = freq+10))
g <- g + geom_point(aes(colour=freq, size = freq))
g <- g + scale_colour_gradient(low = "lightblue", high="white")
g <- g + geom_smooth(method="lm", formula=y~x)
g

6 Regression to the Mean

6.1 Galton’s Investigation on Regression to the Mean

  • both \(X\), child’s heights, and \(Y\), parent’s heights, are normalized so that they have mean of 0 and variance of 1
  • regression lines must pass \((\bar X, \bar Y)\) or \((0, 0)\) in this case
  • slope of regression line = \(Cor(Y,X)\) regardless of which variable is the outcome or regressor (because standard deviations of both variables = 1)
  • Notice if \(X\) is the outcome and you create a plot where \(X\) is the horizontal axis, the slope of the least squares line that you plot is \(1/Cor(Y, X)\).
    • Note: for both regression lines to be plotted on the same graph, the second line’s slope must be \(1/Cor(Y, X)\) because the two relationships have flipped axes
# load father.son data
data(father.son)
# normalize son's height
y <- (father.son$sheight - mean(father.son$sheight)) / sd(father.son$sheight)
# normalize father's height
x <- (father.son$fheight - mean(father.son$fheight)) / sd(father.son$fheight)
# calculate correlation
rho <- cor(x, y)
# plot the relationship between the two
g = ggplot(data.frame(x = x, y = y), aes(x = x, y = y))
g = g + geom_point(size = 3, alpha = .2, colour = "black")
g = g + geom_point(size = 2, alpha = .2, colour = "salmon")
g = g + xlim(-4, 4) + ylim(-4, 4)
# reference line for perfect correlation between
# variables (data points will like on diagonal line)
g = g + geom_abline()
# if there is no correlation between the two variables,
# the data points will lie on horizontal/vertical lines
g = g + geom_vline(xintercept = 0)
g = g + geom_hline(yintercept = 0)
# plot the actual correlation for both
g = g + geom_abline(intercept = 0, slope = rho, size = 2)
g = g + geom_abline(intercept = 0, slope = 1 / rho, size = 2)
# add appropriate labels
g = g + xlab("Father's height, normalized")
g = g + ylab("Son's height, normalized")
g = g + geom_text(x = 3.5, y = 1.5, label="son~father", angle = 25) +
geom_text(x = 3.2, y = 3.6, label="cor(y,x)=1", angle = 35) +
geom_text(x = 1.6, y = 3.7, label="father~son", angle = 60)
g

  • suppose \(x\) (father’s height) is 2, then multiplying it with the correlation (which is the regression slope of the normalized data) is the regression of the mean
    • how shrunken this correlation is towards the horizontal line (from the line of perfect correlation) gives you the extent of the regression to the mean

7 Statistical Linear Regression Models

7.1 Interpreting Regression Coefficients

  • for the linear regression line \[Y_i = \beta_0 + \beta_1 X_i + \epsilon_{i}\] MLE for \(\beta_0\) and \(\beta_1\) are as follows \[\hat \beta_1 = Cor(Y, X)\frac{Sd(Y)}{Sd(X)} ~~~ \hat \beta_0 = \bar Y - \hat \beta_1 \bar X\]
  • \(\beta_0\) = expected value of the outcome/response when the predictor is 0 \[ E[Y | X = 0] = \beta_0 + \beta_1 \times 0 = \beta_0 \]
    • Note: \(X=0\) may not always be of interest as it may be impossible/outside of data range (i.e blood pressure, height etc.)
    • it may be useful to move the intercept at times \begin{aligned} Y_i &= _0 + _1 X_i + _i\ &= _0 + a _1 + _1 (X_i - a) + _i \ &= _0 + _1 (X_i - a) + _i ~ ~ ~ ~ where _0 = _0 + a _1\ \end{aligned}

    • Note: shifting \(X\) values by value \(a\) changes the intercept, but not the slope
    • often, \(a\) is set to \(\bar X\) so that the intercept is interpreted as the expected response at the average \(X\) value

  • \(\beta_1\) = expected change in outcome/response for a 1 unit change in the predictor\[ E[Y ~|~ X = x+1] - E[Y ~|~ X = x] = \beta_0 + \beta_1 (x + 1) - (\beta_0 + \beta_1 x ) = \beta_1 \]
    • sometimes it is useful to change the units of \(X\) \[ \begin{aligned} Y_i & = \beta_0 + \beta_1 X_i + \epsilon_i \\ & = \beta_0 + \frac{\beta_1}{a} (X_i a) + \epsilon_i \\ & = \beta_0 + \tilde \beta_1 (X_i a) + \epsilon_i \\ \end{aligned} \]
    • multiplication of \(X\) by a factor \(a\) results in dividing the coefficient by a factor of \(a\)
    • Example:
      • \(X\) = height in \(m\)
      • \(Y\) = weight in \(kg\)
      • \(\beta_1\) has units of \(kg/m\)
      • converting \(X\) to \(cm\) \(\Longrightarrow\) multiplying \(X\) by \(100 \frac{cm}{m}\)
      • this mean \(\beta_1\) has to be divided by \(100 \frac{cm}{m}\) for the correct units. \[ Xm \times 100\frac{cm}{m} = (100~X) cm ~~\mbox{and}~~ \beta_1 \frac{kg}{m} \times \frac{1}{100}\frac{m}{cm} = \left(\frac{\beta_1}{100}\right)\frac{kg}{cm} \]
  • 95% confidence intervals for the coefficients can be constructed from the coefficients themselves and their standard errors (from summary(lm))
    • use the resulting intervals to evaluate the significance of the results

7.2 Use Regression Coefficients for Prediction

  • for observed values of the predictor, \(X_1, X_2, \ldots , X_n\), the prediction of the outcome/response is as follows \[\hat \mu_i = \hat Y_i = \hat \beta_0 + \hat \beta_1 X\] where \(\mu_i\) describes a point on the regression line

7.3 Example and R Commands

  • diamond dataset from UsingR package
    • diamond prices in Singapore Dollars, diamond weight in carats (standard measure of diamond mass, 0.2g)
  • lm(price ~ I(carat - mean(carat)), data=diamond) = mean centered linear regression
    • Note: arithmetic operations must be enclosed in I() to work
  • predict(fitModel, newdata=data.frame(carat=c(0, 1, 2))) = returns predicted outcome from the given model (linear in our case) at the provided points within the newdata data frame
    • if newdata is unspecified (argument omitted), then predict function will return predicted values for all observed values of the predictor (x variable, carat in this case)
      • Note: newdata has to be a dataframe, and the values you would like to predict (x variable, carat in this case) has to be specified, or the system won’t know what to do with the provided values
  • summary(fitModel) = prints detailed summary of linear model
library(UsingR)
data(diamond)
# standard linear regression for price vs carat
fit <- lm(price ~ carat, data = diamond)
# intercept and slope
coef(fit)
## (Intercept)       carat 
##   -259.6259   3721.0249
# mean-centered regression
fit2 <- lm(price ~ I(carat - mean(carat)), data = diamond)
# intercept and slope
coef(fit2)
##            (Intercept) I(carat - mean(carat)) 
##               500.0833              3721.0249
# regression with more granular scale (1/10th carat)
fit3 <- lm(price ~ I(carat * 10), data = diamond)
# intercept and slope
coef(fit3)
##   (Intercept) I(carat * 10) 
##     -259.6259      372.1025
# predictions for 3 values
newx <- c(0.16, 0.27, 0.34)
# manual calculations
coef(fit)[1] + coef(fit)[2] * newx
## [1]  335.7381  745.0508 1005.5225
# prediction using the predict function
predict(fit, newdata = data.frame(carat = newx))
##         1         2         3 
##  335.7381  745.0508 1005.5225
  • interpretation
    • Standard linear regression
      • we estimate an expected 3721.02 (SIN) dollar increase in price for every carat increase in mass of diamond
      • the intercept -259.63 is the expected price of a 0 carat diamond (not very interesting)
    • Mean-centered regression
      • $500.1 is the expected price for the average sized diamond of the data (0.2042 carats)
    • Regression with more granular scale (1/10th carat)
      • 372.1 (SIN) dollar increase in price for every 1/10 carat increase in mass of diamond
  • prediction
    • for 0.16, 0.27, and 0.34 carats, we predict the prices to be 335.74, 745.05, 1005.52 (SIN) dollars
# plot the data points
plot(diamond$carat, diamond$price, xlab = "Mass (carats)", ylab = "Price (SIN $)",
bg = "lightblue", col = "black", cex = 1.1, pch = 21,frame = FALSE)
# plot linear regression line
abline(fit, lwd = 2)
# plot predictions for every value of carat (in red)
points(diamond$carat, predict(fit), pch = 19, col = "red")
# add guidelines for predictions for 0.16, 0.27, and 0.34
lines(c(0.16, 0.16, 0.12), c(200, coef(fit)[1] + coef(fit)[2] * 0.16,
coef(fit)[1] + coef(fit)[2] * 0.16))
lines(c(0.27, 0.27, 0.12), c(200, coef(fit)[1] + coef(fit)[2] * 0.27,
coef(fit)[1] + coef(fit)[2] * 0.27))
lines(c(0.34, 0.34, 0.12), c(200, coef(fit)[1] + coef(fit)[2] * 0.34,
coef(fit)[1] + coef(fit)[2] * 0.34))
# add text labels
text(newx+c(0.03, 0, 0), rep(250, 3), labels = newx, pos = 2)

7.4 Derivation for Maximum Likelihood Estimator

  • Note: this derivation is for the maximum likelihood estimator of the mean, \(\mu\), of a normal distribution as it is the basis of the linear regression model
  • linear regression model \[Y_i = \beta_0 + \beta_1 X_i + \epsilon_{i}\] follows a normal distribution because \(\epsilon_{i} \sim N(0, \sigma^2)\)
  • for the above model, \(E[Y_i] = \mu_i = \beta_0 + \beta_1 X_i\) and \(Var(Y_i) = \sigma^2\)

  • the probability density function (pdf) for an outcome \(x\) from the normal distribution is defined as \[f(x~|~\mu, \sigma^2) = (2\pi \sigma^2)^{-1/2}\exp \left(- \frac{1}{2\sigma^2}(y_i - \mu_i)^2\right)\]
  • the corresponding pdf for \(n\) iid normal random outcomes \(x_1, \ldots, x_n\) is defined as \[f(x_1, \ldots, x_n~|~\mu, \sigma^2) = \prod_{i=1}^n (2\pi \sigma^2)^{-1/2}\exp \left(- \frac{1}{2\sigma^2}(y_i - \mu_i)^2\right)\] which is also known as the likelihood function, denoted in this case as \(\mathcal{L}(\mu, \sigma)\)
  • to find the maximum likelihood estimator (MLE) of the mean, \(\mu\), we take the derivative of the likelihood \(\mathcal{L}\) with respect to \(\mu\) \(\rightarrow\) \(\frac{\partial \mathcal{L}}{\partial \mu}\)
  • since derivatives of products are quite complex to compute, taking the \(\log\) (base \(e\)) makes the calculation much simpler
    • \(\log\) properties:
      • \(\log(AB) = \log(A) + \log(B)\)
      • \(\log(A^B) = B\log(A)\)
    • because \(\log\) is always increasing and monotonic, or preserves order, finding the maximum MLE = finding th maximum of \(\log\) transformation of MLE
  • -2 \(\log\) of likelihood function \[ \begin{aligned} \log(\mathcal{L}(\mu, \sigma)) & = \sum_{i=1}^n -\frac{1}{2}\log(2\pi \sigma^2) - \frac{1}{2\sigma^2}(y_i - \mu_i)^2 \Leftarrow \mbox{multiply -2 on both sides} \\ -2\log(\mathcal{L}(\mu, \sigma)) & = \sum_{i=1}^n \log(2\pi \sigma^2) + \frac{1}{\sigma^2}(y_i - \mu_i)^2 \Leftarrow \sigma^2 \mbox{ does not depend on }i \\ -2\log(\mathcal{L}(\mu, \sigma)) & = n\log(2\pi \sigma^2) + \frac{1}{\sigma^2}\sum_{i=1}^n (y_i - \mu_i)^2\\ \end{aligned} \]
  • minimizing -2 \(\log\) likelihood is computationally equivalent as maximizing \(\log\) likelihood \[ \begin{aligned} \frac{\partial \log(\mathcal{L}(\mu, \sigma))}{\partial \mu} & = \frac{1}{\sigma^2} \frac{\partial \sum_{i=1}^n (y_i - \mu_i)^2}{\partial \mu} = 0 \Leftarrow \mbox{setting this equal to 0 to find minimum}\\ \Rightarrow -\frac{2}{\sigma^2}\sum_{i=1}^n (y_i - \mu_i) & = 0 \Leftarrow \mbox{divide by }-2/\sigma^2 \mbox{ on both sides}\\ \sum_{i=1}^n (y_i - \mu_i) & = 0 \Leftarrow \mbox{move } \mu \mbox{ term over to the right}\\ \sum_{i=1}^n y_i & = \sum_{i=1}^n \mu_i \Leftarrow \mbox{for the two sums to be equal, all the terms must be equal}\\ y_i & = \mu_i \\ \end{aligned} \]
  • in the case of our linear regression, \(\mu_i = \beta_0 + \beta_1 X_i\) so \[ \begin{aligned} \frac{\partial \mathcal{L}(\mu, \sigma))}{\partial \mu} = \frac{\partial \mathcal{L}(\beta_0, \beta_1, \sigma))}{\partial \mu} \\ \mbox{MLE} \Rightarrow Y_i & = \mu_i \\ \mu_i &= \beta_0 + \beta_1 X_i\\ \end{aligned} \]

8 Residuals

8.1 Estimating Residual Variation

  • residual variation measures how well the regression line fit the data points
  • MLE of variance, \(\sigma^2\), of the linear model = \(\frac{1}{n}\sum_{i=1}^n e_i^2\) or the average squared residual/mean squared error
    • the square root of the estimate, \(\sigma\), = root mean squared error (RMSE)
  • however, a more common approach is to use \[\hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2\]
    • \(n-2\) is used instead of \(n\) to make the estimator unbiased \(\rightarrow\) \(E[\hat \sigma^2] = \sigma^2\)
    • Note: the -2 is accounting for the degrees of freedom for intercept and slope, which had to be estimated
  • deviance(fitModel) = calculates sum of the squared error/residual for the linear model/residual variation
  • summary(fitModel)$sigma = returns the residual variation of a fit model or the unbiased RMSE
    • summary(fitModel) = creates a list of different parameters of the fit model
# get data
y <- diamond$price; x <- diamond$carat; n <- length(y)
# linear fit
fit <- lm(y ~ x)
# calculate residual standard error through summary and manually
rbind("from summary" = summary(fit)$sigma, "manual" =sqrt(sum(resid(fit)^2) / (n - 2)))
##                  [,1]
## from summary 31.84052
## manual       31.84052

8.2 Total Variation, \(R^2\), and Derivations

  • total variation = residual variation (variation after removing predictor, not explained by the model) + systematic/regression variation (variation explained by regression model) \[\sum_{i=1}^n (Y_i - \bar Y)^2 = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \]
  • \(R^2\) = percent of total variability that is explained by the regression model \[ \begin{aligned} R^2 & = \frac{\mbox{regression variation}}{\mbox{total variation}} = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} \\ & = 1- \frac{\mbox{residual variation}}{\mbox{total variation}} = 1- \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}\\ \end{aligned} \]
  • \(0 \leq R^2 \leq 1\)
  • \(R^2\) = sample correlation squared
    • cor(outcome, predictor = calculates the correlation between predictor and outcome \(\rightarrow\) the same as calculating \(R^2\)
  • \(R^2\) can be a misleading summary of model fit
    • deleting data \(\rightarrow\) inflate \(R^2\)
    • adding terms to a regression model \(\rightarrow\) always increases \(R^2\)
    • data(anscombe); example(anscombe) demonstrates the fallacy of \(R^2\) through the following
      • basically same mean and variance of X and Y
      • identical correlations (hence same \(R^2\))
      • same linear regression relationship

  • relationship between \(R^2\) and \(r\) \[ \begin{aligned} \mbox{Correlation between X and Y} \Rightarrow r & = Cor(Y, X)\\ R^2 & = \frac{\sum_{i=1}^n (\hat Y_i - \bar Y)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} \\ &recall \Rightarrow (\hat Y_i - \bar Y) = \hat \beta_1 (X_i - \bar X)\\ (substituting (\hat Y_i - \bar Y)) & = \hat \beta_1^2 \frac{\sum_{i=1}^n(X_i - \bar X)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}\\ &recall \Rightarrow \hat \beta_1 = Cor(Y, X)\frac{Sd(Y)}{Sd(X)}\\ (substituting \hat \beta_1) & = Cor(Y, X)^2\frac{Var(Y)}{Var(X)} \times \frac{\sum_{i=1}^n(X_i - \bar X)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2}\\ & recall ~ Var(Y) = \sum_{i=1}^n (Y_i - \bar Y)^2 ~ and ~ Var(X) = \sum_{i=1}^n (X_i - \bar X)^2\\ (simplifying) \Rightarrow R^2 &= Cor(Y,X)^2\\ \mbox{Or } R^2 & = r^2\\ \end{aligned} \]

  • total variation derivation \[ \begin{aligned} \mbox{First, we know that } \bar Y & = \hat \beta_0 + \hat \beta_1 \bar X \\ (transforming) \Rightarrow \hat \beta_0 & = \bar Y - \hat \beta_1 \bar X \\ \\ \mbox{We also know that } \hat Y_i & = \hat \beta_0 + \hat \beta_1 X_i \\ \\ \mbox{Next, the residual } = (Y_i - \hat Y_i) & = Y_i - \hat \beta_0 - \hat \beta_1 X_i \\ (substituting~\hat \beta_0) & = Y_i - (\bar Y - \hat \beta_1 \bar X) - \hat \beta_1 X_i \\ (transforming) \Rightarrow (Y_i - \hat Y_i) & = (Y_i - \bar Y) - \hat \beta_1 (X_i - \bar X) \\ \\ \mbox{Next, the regression difference } = (\hat Y_i - \bar Y) & = \hat \beta_0 - \hat \beta_1 X_i -\bar Y \\ (substituting~\hat \beta_0) & = \bar Y - \hat \beta_1 \bar X - \hat \beta_1 X_i - \bar Y \\ (transforming) \Rightarrow (\hat Y_i - \bar Y) & = \hat \beta_1 (X_i - \bar X) \\ \\ \mbox{Total Variation} = \sum_{i=1}^n (Y_i - \bar Y)^2 & = \sum_{i=1}^n (Y_i - \hat Y_i + \hat Y_i - \bar Y)^2 \Leftarrow (adding~\pm \hat Y_i) \\ (expanding) & = \sum_{i=1}^n (Y_i - \hat Y_i)^2 + 2 \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \\ \\ \mbox{Looking at } \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) & \\ (substituting~(Y_i - \hat Y_i),(\hat Y_i - \bar Y)) &= \sum_{i=1}^n \left[(Y_i - \bar Y) - \hat \beta_1 (X_i - \bar X))\right]\left[\hat \beta_1 (X_i - \bar X)\right]\\ (expanding) & = \hat \beta_1 \sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X) -\hat\beta_1^2\sum_{i=1}^n (X_i - \bar X)^2\\ & (substituting~Y_i, \bar Y) \Rightarrow (Y_i - \bar Y) = (\hat \beta_0 + \hat \beta_1 X_i) - (\hat \beta_0 + \hat \beta_1 \bar X) \\ & (simplifying) \Rightarrow (Y_i - \bar Y) = \hat \beta_1 (X_i - \bar X) \\ (substituting~(Y_i - \bar Y)) & = \hat \beta_1^2 \sum_{i=1}^n (X_i - \bar X)^2-\hat\beta_1^2\sum_{i=1}^n (X_i - \bar X)^2\\ \Rightarrow \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) &= 0 \\ \\ \mbox{Going back to} \sum_{i=1}^n (Y_i - \bar Y)^2 &= \sum_{i=1}^n (Y_i - \hat Y_i)^2 + 2 \sum_{i=1}^n (Y_i - \hat Y_i)(\hat Y_i - \bar Y) + \sum_{i=1}^n (\hat Y_i - \bar Y)^2 \\ (since~second~term= 0) \Rightarrow \sum_{i=1}^n (Y_i - \bar Y)^2 &= \sum_{i=1}^n (Y_i - \hat Y_i)^2 + \sum_{i=1}^n (\hat Y_i - \bar Y)^2\\ \end{aligned} \]

8.3 Examples and R Commands

  • resid(fitModel) or fitModel$residuals = extracts model residuals from the fit model (lm in our case) \(\rightarrow\) list of residual values for every value of X
  • summary(fitModel)$r.squared = return \(R^2\) value of the regression model
# load multiplot function
source("./scripts/multiplot.R")
# get data
y <- diamond$price; x <- diamond$carat; n <- length(y)
# linear regression
fit <- lm(y ~ x)
# calculate residual
e <- resid(fit)
# calculate predicted values
yhat <- predict(fit)
# show that residuals are the differences between y and yhat
max(abs(e - (y - yhat)))
## [1] 9.485746e-13
# show that sum of residuals and
# sum of residuals times regressor equals 0
rbind("sum(e)" = sum(e), "sum(e*x)" = sum(e*x))
##                   [,1]
## sum(e)   -1.865175e-14
## sum(e*x)  6.959711e-15
# create 1 x 2 panel plot
par(mfrow=c(1, 2))
# plot residuals on regression line
plot(x, y, xlab = "Mass (carats)", ylab = "Price (SIN $)", bg = "lightblue",
col = "black", cex = 2, pch = 21,frame = FALSE,main = "Residual on Regression Line")
# draw linear regression line
abline(fit, lwd = 2)
# draw red lines from data points to regression line
for (i in 1 : n){lines(c(x[i], x[i]), c(y[i], yhat[i]), col = "red" , lwd = 2)}
# plot residual vs x
plot(x, e, xlab = "Mass (carats)", ylab = "Residuals (SIN $)", bg = "lightblue",
col = "black", cex = 2, pch = 21,frame = FALSE,main = "Residual vs X")
# draw horizontal line
abline(h = 0, lwd = 2)
# draw red lines from residual to x axis
for (i in 1 : n){lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 2)}

  • Non-linear data/patterns can be more easily revealed through residual plots
# create sin wave pattern
x <- runif(100, -3, 3); y <- x + sin(x) + rnorm(100, sd = .2);
# plot data + regression
g <- ggplot(data.frame(x = x, y = y), aes(x = x, y = y)) +
geom_smooth(method = "lm", colour = "black") +
geom_point(size = 3, colour = "black", alpha = 0.4) +
geom_point(size = 2, colour = "red", alpha = 0.4)+
ggtitle("Residual on Regression Line")
# plot residuals
f <- ggplot(data.frame(x = x, y = resid(lm(y ~ x))), aes(x = x, y = y))+
geom_hline(yintercept = 0, size = 2)+
geom_point(size = 3, colour = "black", alpha = 0.4)+
geom_point(size = 2, colour = "red", alpha = 0.4)+
xlab("X") + ylab("Residual")+ ggtitle("Residual vs X")
multiplot(g, f, cols = 2)

  • Heteroskedasticity = heteroskedastic model’s variance is not constant and is a function of x
# create heteroskedastic data
x <- runif(100, 0, 6); y <- x + rnorm(100, mean = 0, sd = .001 * x)
# plot data + regression
g <- ggplot(data.frame(x = x, y = y), aes(x = x, y = y))+
geom_smooth(method = "lm", colour = "black")+
geom_point(size = 3, colour = "black", alpha = 0.4)+
geom_point(size = 2, colour = "red", alpha = 0.4) +
ggtitle("Residual on Regression Line")
# plot residuals
f <- ggplot(data.frame(x = x, y = resid(lm(y ~ x))), aes(x = x, y = y))+
geom_hline(yintercept = 0, size = 2) +
geom_point(size = 3, colour = "black", alpha = 0.4)+
geom_point(size = 2, colour = "red", alpha = 0.4)+
xlab("X") + ylab("Residual") + ggtitle("Residual vs X")
# combine two plots
multiplot(g, f, cols = 2)

  • Diamond Residual plots using ggplot
# plot residuals using ggplot
diamond$e <- resid(lm(price ~ carat, data = diamond))
g = ggplot(diamond, aes(x = carat, y = e))
g = g + xlab("Mass (carats)")
g = g + ylab("Residual price (SIN $)")
g = g + geom_hline(yintercept = 0, size = 2)
g = g + geom_point(size = 3, colour = "black", alpha=0.5)
g = g + geom_point(size = 2, colour = "blue", alpha=0.2)
# Create 2 residual vectors:
# FIRST residual vector is just a fit with an intercept
# So the residuals are just the deviations around the average price
# SECOND vector is the residuals around the regression line
# with carats as the explanatory variable
e = c(resid(lm(price ~ 1, data = diamond)),
resid(lm(price ~ carat, data = diamond)))
fit = factor(c(rep("Itc", nrow(diamond)),
rep("Itc, slope", nrow(diamond))))
f = ggplot(data.frame(e = e, fit = fit), aes(y = e, x = fit, fill = fit))
f = f + geom_dotplot(binaxis = "y", size = 2, stackdir = "center", binwidth = 20)
f = f + xlab("Fitting approach")
f = f + ylab("Residual price")
multiplot(g, f, cols = 2)

  • Note: The intercept is really the coefficient of a special regressor which has the same value, 1, at every sample.

9 Inference in Regression

9.1 Intervals/Tests for Coefficients

  • standard errors for coefficients \[\begin{aligned} Var(\hat \beta_1) & = Var\left(\frac{\sum_{i=1}^n (Y_i - \bar Y)(X_i - \bar X)}{\sum_{i=1}^n (X_i - \bar X)^2}\right) \\ (expanding) & = Var\left(\frac{\sum_{i=1}^n Y_i (X_i - \bar X) - \bar Y \sum_{i=1}^n (X_i - \bar X)}{\sum_{i=1}^n (X_i - \bar X)^2}\right) \\ & Since~ \sum_{i=1}^n X_i - \bar X = 0 \\ (simplifying) & = \frac{Var \left(\sum_{i=1}^n Y_i (X_i - \bar X) \right)}{(\sum_{i=1}^n (X_i - \bar X)^2)^2} \Leftarrow \mbox{denominator taken out of } Var\\ (Var(Y_i) = \sigma^2) & = \frac{\sigma^2 \sum_{i=1}^n (X_i - \bar X)^2}{(\sum_{i=1}^n (X_i - \bar X)^2)^2} \\ \sigma_{\hat \beta_1}^2 = Var(\hat \beta_1) &= \frac{\sigma^2 }{ \sum_{i=1}^n (X_i - \bar X)^2 }\\ \Rightarrow \sigma_{\hat \beta_1} &= \frac{\sigma}{ \sum_{i=1}^n X_i - \bar X} \\ \\ \mbox{by the same derivation} \Rightarrow & \\ \sigma_{\hat \beta_0}^2 = Var(\hat \beta_0) & = \left(\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }\right)\sigma^2 \\ \Rightarrow \sigma_{\hat \beta_0} &= \sigma \sqrt{\frac{1}{n} + \frac{\bar X^2}{\sum_{i=1}^n (X_i - \bar X)^2 }}\\ \end{aligned}\]

  • \(\sigma\) is unknown but it’s estimate is as follows \[\hat \sigma^2 = \frac{1}{n-2}\sum_{i=1}^n e_i^2\]
  • under iid Gaussian errors (assumed in linear regression with term \(\epsilon_0\)), statistics for \(\hat \beta_0\) and \(\hat \beta_1\) are as follows \[\frac{\hat \beta_j - \beta_j}{\hat \sigma_{\hat \beta_j}} ~ ~ ~where~j = 0, 1\]
    • these statistics follow \(t\) distribution with \(n-2\) degrees of freedom for small \(n\) and normal distribution for large \(n\)
  • summary(fitModel)$coefficients = returns table summarizing the estimate, standar error, t value and p value for the coefficients \(\beta_0\) and \(\beta_1\)
    • the below example reproduces the table summary(fitModel)$coefficients produces
  • Note: the variability in the slope, or \(Var(\hat \beta_1)\), is the largest when the predictor values are spread into two cluster that are far apart from each other
    • when modeling linear relationships, it is generally good practice to have many data points that evenly cover the entire range of data, increasing the denominator \(\sum_{i=1}^n (X_i - \bar X)^2\)
    • this is so that variance of slope is minimized and we can be more confident about the relationship
  • Example

# getting data
y <- diamond$price; x <- diamond$carat; n <- length(y)
# calculate beta1
beta1 <- cor(y, x) * sd(y) / sd(x)
# calculate beta0
beta0 <- mean(y) - beta1 * mean(x)
# Gaussian regression error
e <- y - beta0 - beta1 * x
# unbiased estimate for variance
sigma <- sqrt(sum(e^2) / (n-2))
# sum(X_i - X Bar)^2
ssx <- sum((x - mean(x))^2)
# calculate standard errors
seBeta0 <- (1 / n + mean(x) ^ 2 / ssx) ^ .5 * sigma
seBeta1 <- sigma / sqrt(ssx)
# statistics for beta0 and beta1; testing for H0: beta0 = 0 and beta1 = 0
tBeta0 <- beta0 / seBeta0; tBeta1 <- beta1 / seBeta1
# calculating p-values for Ha: beta0 != 0 and beta0 != 0 (two sided)
pBeta0 <- 2 * pt(abs(tBeta0), df = n - 2, lower.tail = FALSE)
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
# store results into table
coefTable <- rbind(c(beta0, seBeta0, tBeta0, pBeta0), c(beta1, seBeta1, tBeta1, pBeta1))
colnames(coefTable) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTable) <- c("(Intercept)", "x")
# print table
coefTable
##              Estimate Std. Error   t value      P(>|t|)
## (Intercept) -259.6259   17.31886 -14.99094 2.523271e-19
## x           3721.0249   81.78588  45.49715 6.751260e-40
# regression model and the generated table from lm (identical to above)
fit <- lm(y ~ x)
summary(fit)$coefficients
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -259.6259   17.31886 -14.99094 2.523271e-19
## x           3721.0249   81.78588  45.49715 6.751260e-40
# store results in matrix
sumCoef <- summary(fit)$coefficients
# print out confidence interval for beta0
sumCoef[1,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[1, 2]
## [1] -294.4870 -224.7649
# print out confidence interval for beta1 in 1/10 units
(sumCoef[2,1] + c(-1, 1) * qt(.975, df = fit$df) * sumCoef[2, 2]) / 10
## [1] 355.6398 388.5651
# using built-in function
confint(fit)
##                2.5 %    97.5 %
## (Intercept) -294.487 -224.7649
## x           3556.398 3885.6513
  • interpretation: With 95% confidence, we estimate that a 0.1 carat increase in diamond size results in a 355.6 to 388.6 increase in price in (Singapore) dollars.

9.2 Prediction Interval

  • estimated prediction, \(\hat y_0\), at point \(x_0\) is \[\hat y_0 = \hat \beta_0 + \hat \beta_1 x_0\]
  • we can construct two prediction intervals
    1. interval for the line at \(x_0\) \[ \begin{aligned} \mbox{Interval}: & \hat y_0 \pm t_{n-2, 1-\alpha/2} \times SE_{line} \\ \mbox{where } & \hat y_0 = \hat \beta_0 + \hat \beta_1 x_0 \\ \mbox{and } & SE_{line} = \hat \sigma\sqrt{\frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}\\ \end{aligned} \]
      • interval has varying width
      • interval is narrow as we are quite confident in the regression line
      • as \(n\) increases, the interval becomes narrower, which makes sense because as more data is collected, we are able to get a better regression line
        • Note: if we knew \(\beta_0\) and \(\beta_1\), this interval would have zero width
    2. interval for the predicted value, \(\hat y_0\), at \(x_0\) \[ \begin{aligned} \mbox{Interval}: & \hat y_0 \pm t_{n-2, 1-\alpha/2} \times SE_{\hat y_0} \\ \mbox{where } & \hat y_0 = \hat \beta_0 + \hat \beta_1 x_0 \\ \mbox{and } & SE_{\hat y_0} = \hat \sigma\sqrt{1 + \frac{1}{n} + \frac{(x_0 - \bar X)^2}{\sum_{i=1}^n (X_i - \bar X)^2}}\\ \end{aligned} \]
      • interval has varying width
      • the \(1\) part in the \(SE_{\hat y_0}\) formula represents the inherent variability in the data
        • no matter how good of a regression line we get, we still can not get rid of the variability in the data
        • Note: even if we know \(\beta_0\) and \(\beta_1\), the interval would still have width due to data variance
  • predict(fitModel, data, interval = ("confidence")) = returns a 3-column matrix with data for fit (regression line), lwr (lower bound of interval), and upr (upper bound of interval)
    • interval = ("confidence") = returns interval for the line
    • interval = ("prediction") = returns interval for the prediction
    • data = must be a new data frame with the values you would like to predict
  • Example (ggplot2)
# create a sequence of values that we want to predict at
newx = data.frame(x = seq(min(x), max(x), length = 100))
# calculate values for both intervals
p1 = data.frame(predict(fit, newdata = newx, interval = ("confidence")))
p2 = data.frame(predict(fit, newdata = newx, interval = ("prediction")))
# add column for interval labels
p1$interval = "confidence"; p2$interval = "prediction"
# add column for the x values we want to predict
p1$x = newx$x; p2$x = newx$x
# combine the two dataframes
dat = rbind(p1, p2)
# change the name of the first column to y
names(dat)[1] = "y"
# plot the data
g <- ggplot(dat, aes(x = x, y = y))
g <- g + geom_ribbon(aes(ymin = lwr, ymax = upr, fill = interval), alpha = 0.2)
g <- g + geom_line()
g + geom_point(data = data.frame(x = x, y=y), aes(x = x, y = y), size = 4)

  • Example (base)
# plot the x and y values
plot(x,y,frame=FALSE,xlab="Carat",ylab="Dollars",pch=21,col="black",bg="lightblue",cex=2)
# add the fit line
abline(fit,lwd=2)
# create sequence of x values that we want to predict at
xVals<-seq(min(x),max(x),by=.01)
# calculate the predicted y values
yVals<-beta0+beta1*xVals
# calculate the standard errors for the interval for the line
se1<-sigma*sqrt(1/n+(xVals-mean(x))^2/ssx)
# calculate the standard errors for the interval for the predicted values
se2<-sigma*sqrt(1+1/n+(xVals-mean(x))^2/ssx)
# plot the upper and lower bounds of both intervals
lines(xVals,yVals+2*se1, col = "red"); lines(xVals,yVals-2*se1, col = "red")
lines(xVals,yVals+2*se2, col = "blue"); lines(xVals,yVals-2*se2, col = "blue")

  • Both intervals have varying widths.
    • Least width at the mean of the Xs.
  • We are quite confident in the regression line, so that interval is very narrow.
    • If we knew \(\beta_0\) and \(\beta_1\) this interval would have zero width.
  • The prediction interval must incorporate the variabilibity in the data around the line.
    • Even if we knew \(\beta_0\) and \(\beta_1\) this interval would still have width.

10 Multivariate Regression

10.1 Derivation of Coefficients

  • we know for simple univariate regression through the origin \[E[Y_i]=X_{1i}\beta_1 \\ \hat \beta_1 = \frac{\sum_{i=1}^n X_i Y_i}{\sum_{i=1}^n X_i^2}\]
  • we want to minimize \[\sum_{i=1}^n (Y_i - X_{1i}\beta_1 - X_{2i}\beta_2 - \ldots - X_{pi}\beta_p)^2\]
  • we begin by looking at the two variable model where \[E[Y_i] = \mu_i = X_{1i}\beta_1 + X_{2i}\beta_2 \\ \hat \mu_i = X_{1i}\hat \beta_1 + X_{2i}\hat \beta_2\]
  • from our previous derivations, to minimize the sum of squares, the following has to true \[\sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) = 0\]
  • plugging in \(\hat \mu_i\) and \(\mu_i\) \[ \begin{aligned} \sum_{i=1}^n (Y_i - \hat \mu_i) (\hat \mu_i - \mu_i) & = \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - X_{2i}\hat \beta_2)(X_{1i}\hat \beta_1 + X_{2i}\hat \beta_2 - X_{1i}\beta_1 - X_{2i}\beta_2)\\ (simplifying) & = \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - X_{2i}\hat \beta_2)\left[ X_{1i}(\hat \beta_1 - \beta_1) + X_{2i} (\hat \beta_2 - \beta_2)\right]\\ (expanding) & = \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - X_{2i}\hat \beta_2) X_{1i}(\hat \beta_1 - \beta_1) + \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - X_{2i}\hat \beta_2) X_{2i} (\hat \beta_2 - \beta_2)\\ \end{aligned}\]
  • for the entire expression to equal to zero, \[\begin{aligned} \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - X_{2i}\hat \beta_2) X_{1i}(\hat \beta_1 - \beta_1) & = 0 \\ (since~\hat \beta_1,~\beta_1~don't~depend~on~i) \Rightarrow \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - X_{2i}\hat \beta_2) X_{1i}& = 0 ~ ~ ~ ~ ~ ~\mbox{(1)}\\ \\ \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - X_{2i}\hat \beta_2) X_{2i}(\hat \beta_2 - \beta_2) & = 0 \\ (since~\hat \beta_2,~\beta_2~don't~depend~on~i) \Rightarrow \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - X_{2i}\hat \beta_2) X_{2i} & = 0 ~ ~ ~ ~ ~ ~\mbox{(2)}\\ \end{aligned} \]
  • we can hold \(\hat \beta_1\) fixed and solve (2) for \(\hat \beta_2\) \[ \begin{aligned} \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1) X_{2i} - \sum_{i=1}^n X_{2i}^2 \hat \beta_2& = 0\\ \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1) X_{2i} &= \sum_{i=1}^n X_{2i}^2 \hat \beta_2 \\ \hat \beta_2 & = \frac{\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1) X_{2i}}{\sum_{i=1}^n X_{2i}^2} \\ \end{aligned} \]

  • plugging \(\hat \beta_2\) back into (1), we get \[ \begin{aligned} \sum_{i=1}^n \left[Y_i - X_{1i}\hat \beta_1 - \frac{\sum_{j=1}^n (Y_j - X_{1j}\hat \beta_1) X_{2j}}{\sum_{j=1}^n X_{2j}^2}X_{2i}\right] X_{1i} & = 0 \\ \sum_{i=1}^n \left[Y_i - X_{1i}\hat \beta_1 - \frac{\sum_{j=1}^n Y_jX_{2j}}{\sum_{j=1}^n X_{2j}^2}X_{2i} + \frac{\sum_{j=1}^n X_{1j} X_{2j}}{\sum_{j=1}^n X_{2j}^2}X_{2i}\hat \beta_1\right] X_{1i} & = 0 \\ \sum_{i=1}^n \left[\left(Y_i - \frac{\sum_{j=1}^n Y_jX_{2j}}{\sum_{j=1}^n X_{2j}^2}X_{2i}\right) - \hat \beta_1 \left(X_{1i} - \frac{\sum_{j=1}^n X_{1j} X_{2j}}{\sum_{j=1}^n X_{2j}^2}X_{2i}\right)\right] X_{1i} & = 0 \\ \Rightarrow \left(Y_i - \frac{\sum_{j=1}^n Y_jX_{2j}}{\sum_{j=1}^n X_{2j}^2}X_{2i}\right) & = \mbox{residual for } Y_i = X_{i2} \hat \beta_2 + \epsilon_i\\ \Rightarrow \left(X_{1i} - \frac{\sum_{j=1}^n X_{1j} X_{2j}}{\sum_{j=1}^n X_{2j}^2}X_{2i}\right) & = \mbox{residual for } X_{i1} = X_{i2} \hat \gamma + \epsilon_i\\ \end{aligned} \]
  • we can rewrite \[\sum_{i=1}^n \left[\left(Y_i - \frac{\sum_{j=1}^n Y_jX_{2j}}{\sum_{j=1}^n X_{2j}^2}X_{2i}\right) - \hat \beta_1 \left(X_{1i} - \frac{\sum_{j=1}^n X_{1j} X_{2j}}{\sum_{j=1}^n X_{2j}^2}X_{2i}\right)\right] X_{1i} = 0 ~ ~ ~ ~ ~ ~(3)\] as \[ \sum_{i=1}^n \left[ e_{i, Y|X_2} -\hat \beta_1 e_{i, X_1|X_2} \right] X_{1i}= 0\] where \[e_{i, a|b} = a_i - \frac{ \sum_{j=1}^n a_j b_j}{ \sum_{j=1}^n b_j^2}b_i\] which is interpreted as the residual when regressing \(b\) from \(a\) without an intercept

  • We get the solution \[\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}X_{1i}} ~ ~ ~ ~ ~ ~(4)\]

  • But note that \[ \begin{aligned} \sum_{i=1}^n e_{i, X_1 | X_2}^2 & = \sum_{i=1}^n e_{i, X_1 | X_2}\left(X_{1i} - \frac{\sum_{j=1}^n X_{1j}X_{2j} }{ \sum_{j=1}^n X_{2j}^2 } X_{2i}\right) \\ & = \sum_{i=1}^n e_{i, X_1 | X_2}X_{1i} - \frac{\sum_{j=1}^n X_{1j}X_{2j} }{ \sum_{j=1}^n X_{2j}^2 } \sum_{i=1}^n e_{i, X_1 | X_2} X_{2i}\\ & (recall~that~ \sum_{i=1}^n e_{i}X_i = 0,~so~the~2^{nd}~term~is~0) \\ \Rightarrow \sum_{i=1}^n e_{i, X_1 | X_2}^2 & = \sum_{i=1}^n e_{i, X_1 | X_2}X_{1i}\\ \end{aligned} \]

  • plugging the above equation back into (4), we get \[\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2}\]

  • general case
    • pick one regressor and to replace all other variables by the residuals of their regressions against that one \[\sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{pi}\hat \beta_p)X_k = 0\] for \(k = 1, \ldots, p\) yields \(p\) equations with \(p\) unknowns
    • holding \(\hat \beta_1, \ldots, \hat \beta_{p-1}\) constant, we get \[\hat \beta_p = \frac{ \sum_{i=1}^n (Y_i - X_{1i}\hat \beta_1 - \ldots - X_{p-1, i}\hat \beta_{p-1})X_{pi}}{\sum_{i=1}^n X_{pi}^2}\]
    • plugging \(\hat \beta_p\) back into the equation \[\sum_{i=1}^n (e_{i,Y|X_p} - e_{i,X_1|X_p}\hat \beta_1 - \ldots - e_{i,X_{p-1}|X_p}\hat \beta_{p-1} )X_k = 0\]
    • Note that \[X_k = e_{i,X_k|X_p} + \frac{\sum_{i=1}^n X_{ki}X_{pi}}{\sum_{i=1}^n X_{pi}^2}X_p\] and that \[\sum_{i=1}^n e_{i,X_j|X_p}X_{pi} = 0\] Thus \[\sum_{i=1}^n (e_{i,Y|X_p} - e_{i,X_1|X_p}\hat \beta_1 - \ldots - e_{i,X_{p-1}|X_p}\hat \beta_{p-1} )X_k = 0\] is equal to \[\sum_{i=1}^n (e_{i,Y|X_p} - e_{i,X_1|X_p}\hat \beta_1 - \ldots - e_{i,X_{p-1}|X_p}\hat \beta_{p-1} )e_{i, X_k|X_p} = 0\]
    • this procedure reduces \(p\) LS equations and p unknowns to \(p-1\) LS equations and \(p-1\) unknowns
      • every variable is replaced by its residual with \(X_p\)
      • process iterates until only \(Y\) and one variable remains
      • or more intuitively, we take residuals over the confounding variables and do regression through the origin
  • Example
    • for simple linear regression, \(Y_{i} = \beta_1 X_{1i} + \beta_2 X_{2i}\) where \(X_{2i} = 1\) is an intercept term
    • the residuals \[e_{i, Y | X_2} = Y_i - \frac{ \sum_{j=1}^n Y_j X_{2j}}{ \sum_{j=1}^n X_{2j}^2}X_{2i} = Y_i - \bar Y\]
      • Note: this is according to previous derivation of the slope of a regression line through the origin
    • the residuals \(e_{i, X_1 | X_2}= X_{1i} - \frac{\sum_{j=1}^n X_{1j}X_{2j}}{\sum_{j=1}^n X_{2j}^2}X_{2i} = X_{1i} - \bar X_1\)
      • Note: this is according to previous derivation of the slope of a regression line through the origin
    • Thus \[\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2} = \frac{\sum_{i=1}^n (X_i - \bar X)(Y_i - \bar Y)}{\sum_{i=1}^n (X_i - \bar X)^2} = Cor(X, Y) \frac{Sd(Y)}{Sd(X)}\]

10.2 Interpretation of Coefficients

  • from the derivation in the previous section, we have \[\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2}\]
  • this is interpreted as the effect of variable \(X_1\) when the effects of all other variables have been removed from \(X_1\) and the predicted result \(Y\) (holding everything else constant/adjusting for all other variables)

  • the expected response is as follows \[E[Y | X_1 = x_1, \ldots, X_p = x_p] = \sum_{k=1}^p x_{k} \beta_k\] so the expected change in the response through change in one variable is \[ \begin{aligned} & E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] - E[Y | X_1 = x_1, \ldots, X_p = x_p] \\ = & (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k} \beta_k - \sum_{k=1}^p x_{k} \beta_k \\ = & (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k} \beta_k - (x_1 \beta_1 + \sum_{k=2}^p x_{k} \beta_k) \\ = & \beta_1 \\ \end{aligned} \]
  • therefore, interpretation of a multivariate regression coefficient \(\rightarrow\) expected change in the response per unit change in the regressor, holding all of the other regressors fixed
  • all of the SLR properties/calculations extend to generalized linear model
    • model = \(Y_i = \sum_{k=1}^p X_{ik} \beta_{k} + \epsilon_{i}\) where \(\epsilon_i \sim N(0, \sigma^2)\)
    • fitted response = \(\hat Y_i = \sum_{k=1}^p X_{ik} \hat \beta_{k}\)
    • residual = \(e_i = Y_i - \hat Y_i\)
    • variance estimate = \(\hat \sigma^2 = \frac{1}{n-p} \sum_{i=1}^n e_i ^2\)
    • predicted responses at new values, \(x_1, \ldots, x_p\) = plug \(x\) values into \(\sum_{k=1}^p x_{k} \hat \beta_{k}\)
    • standard errors of coefficients = \(\hat \sigma_{\hat \beta_k}\)
    • test/CI statistic = \(\frac{\hat \beta_k - \beta_k}{\hat \sigma_{\hat \beta_k}}\) follows a \(T\) distribution with \(n-p\) degrees of freedom
    • predicted/expected response intervals = calculated using standard errors of predicted responses of \(\hat Y_i\)

10.3 Example: Linear Model with 2 Variables and Intercept

# simulate the data
n = 100; x = rnorm(n); x2 = rnorm(n); x3 = rnorm(n)
# equation = intercept + var1 + var2 + var3 + error
y = 1 + x + x2 + x3 + rnorm(n, sd = .1)
# residual of y regressed on var2 and var3
ey = resid(lm(y ~ x2 + x3))
# residual of y regressed on var2 and var3
ex = resid(lm(x ~ x2 + x3))
# estimate beta1 for var1
sum(ey * ex) / sum(ex ^ 2)
## [1] 0.9958997
# regression through the origin with xva1 with var2 var3 effect removed
coef(lm(ey ~ ex - 1))
##        ex 
## 0.9958997
# regression for all three variables
coef(lm(y ~ x + x2 + x3))
## (Intercept)           x          x2          x3 
##   0.9926959   0.9958997   0.9818908   0.9904702

10.4 Example: Coefficients that Reverse Signs

  • Note: more information can be found at ?swiss
  • data set is composed of standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888
  • data frame has 47 observations on 6 variables, each of which is in percent [0, 100]
    • Fertility = common standardized fertility measure \(\rightarrow\) outcome
    • Agriculture = % of males involved in agriculture as occupation
    • Examination = % draftees receiving highest mark on army examination
    • Education = % education beyond primary school for draftees
    • Catholic = % catholic vs protestant
    • Infant.Mortality = live births who live less than 1 year
  • [GGally package] ggpairs(data) = produces pair wise plot for the predictors similar to pairs in base package
# load dataset
require(datasets); data(swiss); require(GGally)
# produce pairwise plot using ggplot2
ggpairs(swiss, lower = list(continuous = wrap("smooth", method = "lm")))

# print coefficients of regression of fertility on all predictors
summary(lm(Fertility ~ . , data = swiss))$coefficients
##                    Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)      66.9151817 10.70603759  6.250229 1.906051e-07
## Agriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
## Examination      -0.2580082  0.25387820 -1.016268 3.154617e-01
## Education        -0.8709401  0.18302860 -4.758492 2.430605e-05
## Catholic          0.1041153  0.03525785  2.952969 5.190079e-03
## Infant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03
  • interpretation for Agriculture coefficient
    • we expect an -0.17 decrease in standardized fertility for every 1% increase in percentage of males involved in agriculture in holding the remaining variables constant
    • since the p-value is 0.0187272, the t-test for \(H_0: \beta_{Agri} = 0\) versus \(H_a: \beta_{Agri} \neq 0\) is significant
  • however, if we look at the unadjusted estimate (marginal regression) for the coefficient for Agriculture
# run marginal regression on Agriculture
summary(lm(Fertility ~ Agriculture, data = swiss))$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 60.3043752 4.25125562 14.185074 3.216304e-18
## Agriculture  0.1942017 0.07671176  2.531577 1.491720e-02
  • interpretation for Agriculture coefficient
    • we expect an 0.19 increase in standardized fertility for every 1% increase in percentage of males involved in agriculture in holding the remaining variables constant
      • Note: the coefficient flipped signs
    • since the p-value is 0.0149172, the t-test for \(H_0: \beta_{Agri} = 0\) versus \(H_a: \beta_{Agri} \neq 0\) is significant
  • to see intuitively how a sign change is possible, we can look at the following simulated example
# simulate data
n <- 100; x2 <- 1 : n; x1 <- .01 * x2 + runif(n, -.1, .1); y = -x1 + x2 + rnorm(n, sd = .01)
# print coefficients
c("with x1" = summary(lm(y ~ x1))$coef[2,1],
"with x1 and x2" = summary(lm(y ~ x1 + x2))$coef[2,1])
##        with x1 with x1 and x2 
##     94.2393492     -0.9952328
# print p-values
c("with x1" = summary(lm(y ~ x1))$coef[2,4],
"with x1 and x2" = summary(lm(y ~ x1 + x2))$coef[2,4])
##        with x1 with x1 and x2 
##   2.296891e-70   5.009692e-76
# store all data in one data frame (ey and ex1 are residuals with respect to x2)
dat <- data.frame(y = y, x1 = x1, x2 = x2, ey = resid(lm(y ~ x2)), ex1 = resid(lm(x1 ~ x2)))
# plot y vs x1
g <- ggplot(dat, aes(y = y, x = x1, colour = x2)) +
geom_point(colour="grey50", size = 2) +
geom_smooth(method = lm, se = FALSE, colour = "black") + geom_point(size = 1.5) +
ggtitle("unadjusted = y vs x1")
# plot residual of y adjusted for x2 vs residual of x1 adjusted for x2
g2 <- ggplot(dat, aes(y = ey, x = ex1, colour = x2)) +
geom_point(colour="grey50", size = 2) +
geom_smooth(method = lm, se = FALSE, colour = "black") + geom_point(size = 1.5) +
ggtitle("adjusted = y, x1 residuals with x2 removed") + labs(x = "resid(x1~x2)",
y = "resid(y~x2)")
# combine plots
multiplot(g, g2, cols = 2)

  • as we can see from above, the correlation between y and x1 flips signs when adjusting for x2
  • this effectively means that within each consecutive group/subset of points (each color gradient) on the left hand plot (unadjusted), there exists a negative relationship between the points while the overall trend is going up

  • going back to the swiss data set, the sign of the coefficient for Agriculture reverses itself with the inclusion of Examination and Education (both are negatively correlated with Agriculture)
    • correlation between Agriculture and Education: -0.64
    • correlation between Agriculture and Examination: -0.69
    • correlation between Education and Examination: 0.7
      • this means that the two variables are likely to be measuring the same things
  • Note: it is difficult to interpret and determine which one is the correct model \(\rightarrow\) one should not claim positive correlation between Agriculture and Fertility simply based on marginal regression lm(Fertility ~ Agriculture, data=swiss)

10.5 Example: Unnecessary Variables

  • unnecessary predictors = variables that don’t provide any new linear information, meaning that the variable are simply linear combinations (multiples, sums) of other predictors/variables
  • when running a linear regression with unnecessary variables, R automatically drops the linear combinations and returns NA as their coefficients
# add a linear combination of agriculture and education variables
z <- swiss$Agriculture + swiss$Education
# run linear regression with unnecessary variables
lm(Fertility ~ . + z, data = swiss)$coef
##      (Intercept)      Agriculture      Examination        Education 
##       66.9151817       -0.1721140       -0.2580082       -0.8709401 
##         Catholic Infant.Mortality                z 
##        0.1041153        1.0770481               NA
  • as we can see above, the R dropped the unnecessary variable z by excluding it from the linear regression \(\rightarrow\) z’s coefficient is NA

11 Dummy Variables

11.1 More Than 2 Levels

  • for 3 factor levels, we would need 2 dummy variables and the model would be \[Y_i = \beta_0 + X_{i1} \beta_1 + X_{i2} \beta_2 + \epsilon_i\]
  • for this example, we will use the above model to analyze US political party affiliations (Democrats vs Republicans vs independents) and denote the variables as follows:
    • \(X_{i1} = 1\) for Republicans and \(0\) otherwise
    • \(X_{i2} = 1\) for Democrats and \(0\) otherwise
    • If \(i\) is Republican, \(X_{i1} = 1\), \(X_{i2} = 0\), \(E[Y_i] = \beta_0 +\beta_1\)
    • If \(i\) is Democrat, \(X_{i1} = 0\), \(X_{i2} = 1\), \(E[Y_i] = \beta_0 + \beta_2\)
    • If \(i\) is independent, \(X_{i1} = 0\), \(X_{i2} = 0\), \(E[Y_i] = \beta_0\)
    • \(\beta_1\) compares Republicans to independents
    • \(\beta_2\) compares Democrats to independents
    • \(\beta_1 - \beta_2\) compares Republicans to Democrats
  • Note: choice of reference category (independent in this case) changes the interpretation
    • reference category = the group whose binary variable has been eliminated
  • the same principles explained above can be expanded to \(p\) level model \[Y_i = \beta_0 + X_{i1} \beta_1 + X_{i2} \beta_2 + \ldots + X_{ip} \beta_p + \epsilon_i\]

11.2 Example: 6 Factor Level Insect Spray Data

  • below is a violin plot of the 6 different types (A, B, C, D, E, and F) of insect sprays and their potency (kill count) from InsectSprays data set
    • Note: the varying width of each bar indicates the density of measurement at each value
# load insect spray data
data(InsectSprays)
ggplot(data = InsectSprays, aes(y = count, x = spray, fill = spray)) +
geom_violin(colour = "black", size = 2) + xlab("Type of spray") +
ylab("Insect count")

Linear model fit with group A as reference category

# linear fit with 5 dummy variables
fit <- summary(lm(count ~ spray, data = InsectSprays))$coefficients; fit
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  14.5000000   1.132156 12.8074279 1.470512e-19
## sprayB        0.8333333   1.601110  0.5204724 6.044761e-01
## sprayC      -12.4166667   1.601110 -7.7550382 7.266893e-11
## sprayD       -9.5833333   1.601110 -5.9854322 9.816910e-08
## sprayE      -11.0000000   1.601110 -6.8702352 2.753922e-09
## sprayF        2.1666667   1.601110  1.3532281 1.805998e-01
  • Note: R automatically converts factor variables into \(n-1\) dummy variables and uses the first category as reference
    • mean of group A is therefore the default intercept
  • the above coefficients can be interpreted as the difference in means between each group (B, C, D, E, and F) and group A (the intercept)
    • example: the mean of group B is 0.83 higher than the mean of group A, which is 14.5
    • means for group B/C/D/E/F = the intercept + their respective coefficient
    • 0.83 minus -12.42 compares group B with group C
  • all t-tests are for comparisons of Sprays versus Spray A

Hard-coding the dummy variables

  • this produces the exact same result as the command lm(count ~ spray, data = InsectSprays)
# hard coding dummy variables
summary(lm(count ~ I(1 * (spray == 'B')) + I(1 * (spray == 'C')) +
I(1 * (spray == 'D')) + I(1 * (spray == 'E')) +
I(1 * (spray == 'F')), data = InsectSprays))$coefficients
##                          Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)            14.5000000   1.132156 12.8074279 1.470512e-19
## I(1 * (spray == "B"))   0.8333333   1.601110  0.5204724 6.044761e-01
## I(1 * (spray == "C")) -12.4166667   1.601110 -7.7550382 7.266893e-11
## I(1 * (spray == "D"))  -9.5833333   1.601110 -5.9854322 9.816910e-08
## I(1 * (spray == "E")) -11.0000000   1.601110 -6.8702352 2.753922e-09
## I(1 * (spray == "F"))   2.1666667   1.601110  1.3532281 1.805998e-01

Linear model fit with all 6 categories

# linear fit with 6 dummy variables
lm(count ~ I(1 * (spray == 'B')) + I(1 * (spray == 'C')) +
I(1 * (spray == 'D')) + I(1 * (spray == 'E')) +
I(1 * (spray == 'F')) + I(1 * (spray == 'A')),
data = InsectSprays)$coefficients
##           (Intercept) I(1 * (spray == "B")) I(1 * (spray == "C")) 
##            14.5000000             0.8333333           -12.4166667 
## I(1 * (spray == "D")) I(1 * (spray == "E")) I(1 * (spray == "F")) 
##            -9.5833333           -11.0000000             2.1666667 
## I(1 * (spray == "A")) 
##                    NA
  • as we can see from above, the coefficient for group A is NA
  • this is because \(X_{iA} = 1 - X_{iB} - X_{iC}- X_{iD}- X_{iE}- X_{iF}\), or the dummy variable for A is a linear combination of the rest of the dummy variables

Linear model fit with omitted intercept

  • eliminating the intercept would mean that each group is compared to the value 0, which would yield 6 variables since A is no longer the reference category
  • this means that the coefficients for the 6 variables are simply the mean of each group
    • when \(X_{iA} = 1\), all the other dummy variables become 0, which means the linear model becomes \[Y_i = \beta_A + \epsilon_i\]
    • then \(E[Y_i] = \beta_A = \mu_A\)
    • this makes sense because the best prediction for the kill count of spray of type A is the mean of recorded kill counts of A spray
# linear model with omitted intercept
nfit <- summary(lm(count ~ spray - 1, data = InsectSprays))$coefficients; nfit
##         Estimate Std. Error   t value     Pr(>|t|)
## sprayA 14.500000   1.132156 12.807428 1.470512e-19
## sprayB 15.333333   1.132156 13.543487 1.001994e-20
## sprayC  2.083333   1.132156  1.840148 7.024334e-02
## sprayD  4.916667   1.132156  4.342749 4.953047e-05
## sprayE  3.500000   1.132156  3.091448 2.916794e-03
## sprayF 16.666667   1.132156 14.721181 1.573471e-22
# actual means of count by each variable
round(tapply(InsectSprays$count, InsectSprays$spray, mean), 2)
##     A     B     C     D     E     F 
## 14.50 15.33  2.08  4.92  3.50 16.67
  • p-values are testing for whether the groups are different from zero (i.e. are the expected counts 0 for that spray?)
    • compared to the model above, where groups are compared to group A, the p-values tests whether or not A is different from B, A is different from C, etc. \(\rightarrow\) want one of the groups to be a reference level, because then can do some tests
  • to compare between different categories, say B vs C, we can simply subtract the coefficients

Reordering the levels

  • to reorient the model with other groups as reference categories, we can simply reorder the levels for the factor variable
    • relevel(var, "l") = reorders the factor levels within the factor variable var such that the specified level “l” is the reference/base/lowest level
      • Note: R automatically uses the first/base level as the reference category during a linear regression
# reorder the levels of spray variable such that C is the lowest level
spray2 <- relevel(InsectSprays$spray, "C")
# rerun linear regression with releveled factor
fit2 <- summary(lm(count ~ spray2, data = InsectSprays))$coef; fit2
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)  2.083333   1.132156 1.840148 7.024334e-02
## spray2A     12.416667   1.601110 7.755038 7.266893e-11
## spray2B     13.250000   1.601110 8.275511 8.509776e-12
## spray2D      2.833333   1.601110 1.769606 8.141205e-02
## spray2E      1.416667   1.601110 0.884803 3.794750e-01
## spray2F     14.583333   1.601110 9.108266 2.794343e-13
  • it is important to note in this example that
    • counts are bounded from below by 0 \(\rightarrow\) violates the assumption of normality of the errors
    • there are counts near zero \(\rightarrow\) violates intent of the assumption \(\rightarrow\) not acceptable in assuming normal distribution
    • variance does not appear to be constant across different type of groups (see: violin plot) \(\rightarrow\) violates assumption
      • taking log(counts + 1) may help (+1 since there are zero counts)
    • Poisson GLMs are better (don’t have to worry about the assumptions) for fitting count data

Calculating t-statistics manually

  • calculating the spray2B t value
#(fit[2,1]-fit[3,1])/1.60111
fit2[3,1]/fit2[3,2]
## [1] 8.275511

12 Interactions

# load in hunger data
hunger <- read.csv("./data/hunger.csv")
# exclude the data with "Both Sexes" as values (only want Male vs Female)
hunger <- hunger[hunger$Sex!="Both sexes", ]
# structure of data
str(hunger)
## 'data.frame':    948 obs. of  12 variables:
##  $ Indicator     : Factor w/ 1 level "Children aged <5 years underweight (%)": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Data.Source   : Factor w/ 670 levels "NLIS_310005",..: 7 52 519 380 548 551 396 503 643 632 ...
##  $ PUBLISH.STATES: Factor w/ 1 level "Published": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year          : int  1986 1990 2005 2002 2008 2008 2003 2006 2012 1999 ...
##  $ WHO.region    : Factor w/ 6 levels "Africa","Americas",..: 1 2 2 3 1 1 1 2 1 2 ...
##  $ Country       : Factor w/ 151 levels "Afghanistan",..: 115 104 97 69 57 54 71 13 115 144 ...
##  $ Sex           : Factor w/ 3 levels "Both sexes","Female",..: 3 3 3 2 2 3 3 3 3 2 ...
##  $ Display.Value : num  19.3 2.2 5.3 3.2 17 15.7 19.3 4 15.5 4.2 ...
##  $ Numeric       : num  19.3 2.2 5.3 3.2 17 15.7 19.3 4 15.5 4.2 ...
##  $ Low           : logi  NA NA NA NA NA NA ...
##  $ High          : logi  NA NA NA NA NA NA ...
##  $ Comments      : logi  NA NA NA NA NA NA ...

12.1 Model: % Hungry ~ Year by Sex

  • this will include 2 models with 2 separate lines
  • model for % hungry (\(H_F\)) vs year (\(Y_F\)) for females is \[H_{Fi} = \beta_{F0} + \beta_{F1} Y_{Fi} + \epsilon_{Fi}\]
    • \(\beta_{F0}\) = % of females hungry at year 0
    • \(\beta_{F1}\) = decrease in % females hungry per year
    • \(\epsilon_{Fi}\) = standard error (or everything we didn’t measure)
  • model for % hungry (\(H_M\)) vs year (\(Y_M\)) for males is \[H_{Mi} = \beta_{M0} + \beta_{M1} Y_{Mi} + \epsilon_{Mi}\]
    • \(\beta_{M0}\) = % of males hungry at year 0
    • \(\beta_{M1}\) = decrease in % males hungry per year
    • \(\epsilon_{Mi}\) = standard error (or everything we didn’t measure)
  • each line has different residuals, standard errors, and variances
  • Note: \(\beta_{F0}\) and \(\beta_{M0}\) are the interpolated intercept at year 0, which does hold any interpretable value for the model
    • it’s possible to subtract the model by a meaningful value (% hungry at 1970, or average), which moves the intercept of the lines to something interpretable
  • Note: we are also assuming the error terms \(\epsilon_{Fi}\) and \(\epsilon_{Mi}\) are Gaussian distributions \(\rightarrow\) mean = 0
# run linear model with Numeric vs Year for male and females
male.fit <- lm(Numeric ~ Year, data = hunger[hunger$Sex == "Male", ])
female.fit <- lm(Numeric ~ Year, data = hunger[hunger$Sex == "Female", ])
# plot % hungry vs the year
plot(Numeric ~ Year, data = hunger, pch = 19, col=(Sex=="Male")*1+1)
# plot regression lines for both
abline(male.fit, lwd = 3, col = "red")
abline(female.fit, lwd = 3, col = "black")

12.2 Model: % Hungry ~ Year + Sex (Binary Variable)

  • this will include 1 model with 2 separate lines with the same slope
  • model for % hungry (\(H\)) vs year (\(Y\)) and dummy variable for sex (\(X\)) is \[H_{i} = \beta_{0} + \beta_{1}X_i + \beta_{2} Y_{i} + \epsilon_{i}^*\]
    • \(X_i\) = binary variable, \(1\) if male, otherwise \(0\)
    • \(\beta_{0}\) = % of females hungry at year 0
    • \(\beta_{0} + \beta_{1}\) = % of males hungry at year 0
      • Note: the term \(\beta_{1}X_i\) is effectively an adjustment for the intercept for males and DOES NOT alter the slope in anyway
      • \(\beta_{1}\) = difference in means of males vs females
    • \(\beta_{2}\) = decrease in % hungry (males and females) per year
      • this means that the slope is constant for both females and males
    • \(\epsilon_{i}^*\) = standard error (or everything we didn’t measure)
      • we are still assuming Gaussian error term
  • abline(intercept, slope) = adds a line to the existing plot based on the intercept and slope provided
    • abline(lm) = plots the linear regression line on the plot
# run linear model with Numeric vs Year and Sex
both.fit <- lm(Numeric ~ Year+Sex, data = hunger)
# print fit
both.fit$coef
## (Intercept)        Year     SexMale 
##  633.528289   -0.308397    1.902743
# plot % hungry vs the year
plot(Numeric ~ Year, data = hunger, pch = 19, col=(Sex=="Male")*1+1)
# plot regression lines for both (same slope)
abline(both.fit$coef[1], both.fit$coef[2], lwd = 3, col = "black")
abline(both.fit$coef[1]+both.fit$coef[3], both.fit$coef[2], lwd = 3, col = "red")

12.3 Model: % Hungry ~ Year + Sex + Year * Sex (Binary Interaction)

  • this will include 1 model with an interaction term with binary variable, which produces 2 lines with different slopes
  • we can introduce an interaction term to the previous model to capture the different slopes between males and females
  • model for % hungry (\(H\)) vs year (\(Y\)), sex (\(X\)), and interaction between year and sex (\(Y \times X\)) is \[H_{i} = \beta_{0} + \beta_{1}X_i + \beta_{2} Y_{i} + \beta_{3}X_i Y_{i} + \epsilon_{i}^+\]
    • \(X_i\) = binary variable, \(1\) if male, otherwise \(0\)
    • \(\beta_{0}\) = % of females hungry at year 0
    • \(\beta_{0} + \beta_{1}\) = % of males hungry at year 0
      • \(\beta_{1}\) = change in intercept for males
    • \(\beta_{2}\) = decrease in % hungry (females) per year
    • \(\beta_{2} + \beta_{3}\) = decrease in % hungry (males) per year
      • \(\beta_{3}\) = change in slope for males
    • \(\epsilon_{i}^+\) = standard error (or everything we didn’t measure)
  • expected value for males is \(E[H_i]_M = (\beta_0 + \beta_1) + (\beta_2 + \beta_3) Y_i\)
  • expected value for females is \(E[H_i]_F = \beta_0 + \beta_2 Y_i\)
    • \(\beta_1\) and \(\beta_3\) are effectively adjusting the intercept and slope for males
  • lm(outcome ~ var1*var2) = whenever an interaction is specified in lm function using the * operator, the individual terms are added automatically
    • lm(outcome ~ var1+var2+var1*var2) = builds the exact same model
    • lm(outcome ~ var1:var2) = builds linear model with only the interaction term (specified by : operator)
# run linear model with Numeric vs Year and Sex and interaction term
interaction.fit <- lm(Numeric ~ Year*Sex, data = hunger)
# print fit
interaction.fit$coef
##  (Intercept)         Year      SexMale Year:SexMale 
## 603.50579986  -0.29339638  61.94771998  -0.03000132
# plot % hungry vs the year
plot(Numeric ~ Year, data = hunger, pch = 19, col=(Sex=="Male")*1+1)
# plot regression lines for both (different slope)
abline(interaction.fit$coef[1], interaction.fit$coef[2], lwd = 3, col = "black")
abline(interaction.fit$coef[1]+interaction.fit$coef[3],
interaction.fit$coef[2]+interaction.fit$coef[4], lwd = 3, col = "red")

12.4 Example: % Hungry ~ Year + Income + Year * Income (Continuous Interaction)

  • this will include 1 model with an interaction term with continuous variable, which produces a curve through the plot
  • for continuous interactions (two continuous variables) with model \[Y{i} = \beta_{0} + \beta_{1} X_{1i} + \beta_{2} X_{2i} + \beta_{3} X_{1i} X_{2i} + \epsilon_{i}\] the expected value for a given set of values \(x_1\) and \(x_2\) is defined as \[E[Y_i|X_{1i}=x_1, X_{2i}=x_2] = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{1} x_{2}\]
  • holding \(X_2\) constant and varying \(X_1\) by \(1\), we have \[\begin{aligned} \frac{\partial Y_i}{\partial X_{1i}} & = E[Y_i|X_{1i}=x_1 + 1, X_{2i}=x_2] - E[Y_i|X_{1i}=x_1, X_{2i}=x_2] \\ & = \beta_{0} + \beta_{1} (x_{1}+1) + \beta_{2} x_{2} + \beta_{3} (x_{1}+1) x_{2} - [\beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{1} x_{2}]\\ & = \beta_{1} + \beta_{3} x_{2}\\ \end{aligned}\]
    • Note: this means that slope for \(X_{1i}\) is not constant and is dependent on \(X_{2i}\)
    • \(\beta_1\) is the slope for \(X_{1i}\) when \(X_{2i}\) = 0
  • by the same logic, if we vary \(X_1\) by \(1\) and find the change, and vary \(X_2\) by \(1\) and find the change, we get \[\begin{aligned} \frac{\partial}{\partial X_{2i}} \left(\frac{\partial Y_i}{\partial X_{1i}}\right) & = E[Y_i|X_{1i}=x_1 + 1, X_{2i}=x_2+1] - E[Y_i|X_{1i}=x_1, X_{2i}=x_2+1]\\ & \qquad - \Big(E[Y_i|X_{1i}=x_1+1, X_{2i}=x_2] - E[Y_i|X_{1i}=x_1, X_{2i}=x_2]\Big)\\ & = \beta_{0} + \beta_{1} (x_{1}+1) + \beta_{2} (x_{2}+1) + \beta_{3} (x_{1}+1) (x_{2}+1) - [\beta_{0} + \beta_{1} x_{1} + \beta_{2} (x_{2}+1) + \beta_{3} x_{1} (x_{2}+1)]\\ & \qquad - \Big(\beta_{0} + \beta_{1} (x_{1}+1) + \beta_{2} x_{2} + \beta_{3} (x_{1}+1) x_{2} - [\beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{1} x_{2}]\Big)\\ & = \beta_{3}\\ \end{aligned}\]
    • this can be interpreted as \(\beta_3\) = the expected change in \(Y\) per unit change in \(X_1\) per unit change of \(X_2\)
    • in other words, \(\beta_3\) = the change in slope of \(X_1\) and \(Y\) per unit change of \(X_2\)
  • coming back to the hunger data, model for % hungry (\(H\)) vs year (\(Y\)), income (\(I\)), and interaction between year and income (\(Y \times I\)) is \[H_{i} = \beta_{0} + \beta_{1}I_i + \beta_{2} Y_{i} + \beta_{3}I_i Y_{i} + \epsilon_{i}^+\]
    • \(\beta_{0}\) = % hungry children (whose parents have no income) at year 0
    • \(\beta_{1}\) = change in % hungry children for each dollar in income in year 0
    • \(\beta_{2}\) = change in % hungry children (whose parents have no income) per year
    • \(\beta_{3}\) = change in % hungry children per year and for each dollar in income
      • if income is \(\$10,000\), then the change in % hungry children per year will be \(\beta_2 + 10000 \times \beta_3\)
    • \(\epsilon_{i}^+\) = standard error (or everything we didn’t measure)
  • Note: much care needs to be taken when interpreting these coefficients
# generate some income data
hunger$Income <- 1:nrow(hunger)*10 + 500*runif(nrow(hunger), 0, 10) +
runif(nrow(hunger), 0, 500)^1.5
# run linear model with Numeric vs Year and Income and interaction term
lm(Numeric ~ Year*Income, data = hunger)$coef
##   (Intercept)          Year        Income   Year:Income 
##  1.076723e+03 -5.290447e-01 -3.863526e-02  1.927627e-05

13 Multivariable Simulation

13.1 Simulation 1 - Treatment = Adjustment Effect

  • the following code simulates the linear model, \[Y_i = \beta_0 + \beta_1 x_i + \beta_2 t_i + \epsilon_i\] where \(t = \{0, 1\} \rightarrow\) binary variable
# simulate data
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2), runif(n/2));
# define parameters/coefficients
beta0 <- 0; beta1 <- 2; beta2 <- 1; sigma <- .2
# generate outcome using linear model
y <- beta0 + x * beta1 + t * beta2 + rnorm(n, sd = sigma)
# set up axes
plot(x, y, type = "n", frame = FALSE)
# plot linear fit of y vs x
abline(lm(y ~ x), lwd = 2, col = "blue")
# plot means of the two groups (t = 0 vs t = 1)
abline(h = mean(y[1 : (n/2)]), lwd = 3, col = "red")
abline(h = mean(y[(n/2 + 1) : n]), lwd = 3, col = "red")
# plot linear fit of y vs x and t
fit <- lm(y ~ x + t)
# plot the two lines corresponding to (t = 0 vs t = 1)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
# add in the actual data points
points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", cex = 1)
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 1)

# print treatment and adjustment effects
rbind("Treatment Effect" = lm(y~t+x)$coef[2], "Adjustment Effect" = lm(y~t)$coef[2])
##                          t
## Treatment Effect  1.016810
## Adjustment Effect 1.024089
  • in the above graph, the elements are as follows:
    • \(red\) lines = means for two groups (\(t = 0\) vs \(t = 1\)) \(\rightarrow\) two lines representing lm(y ~ t)
    • \(black\) lines = regression lines for two groups (\(t = 0\) vs \(t = 1\)) \(\rightarrow\) two lines representing lm(y ~ t + x)
    • \(blue\) line = overall regression of \(y\) vs \(x\) \(\rightarrow\) line representing lm(y ~ x)
  • from the graph, we can see that
    • \(x\) variable is unrelated to group status \(t\)
      • distribution of each group (salmon vs light blue) of \(Y\) vs \(X\) is effectively the same
    • \(x\) variable is related to \(Y\), but the intercept depends on group status \(t\)
    • group variable \(t\) is related to \(Y\)
      • relationship between \(t\) and \(Y\) disregarding \(x \approx\) the same as holding \(x\) constant
      • difference in group means \(\approx\) difference in regression lines
      • treatment effect (difference in regression lines) \(\approx\) adjustment effect (difference in group means)

13.2 Simulation 2 - No Treatment Effect

  • the following code simulates the linear model, \[Y_i = \beta_0 + \beta_1 x_i + \beta_2 t_i + \epsilon_i\] where \(t = \{0, 1\} \rightarrow\) binary variable
  • in this case, \(\beta_2\) is set to \(0\)
# simulate data
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2), 1.5 + runif(n/2));
# define parameters/coefficients
beta0 <- 0; beta1 <- 2; beta2 <- 0; sigma <- .2
# generate outcome using linear model
y <- beta0 + x * beta1 + t * beta2 + rnorm(n, sd = sigma)
# set up axes
plot(x, y, type = "n", frame = FALSE)
# plot linear fit of y vs x
abline(lm(y ~ x), lwd = 2, col = "blue")
# plot means of the two groups (t = 0 vs t = 1)
abline(h = mean(y[1 : (n/2)]), lwd = 3, col = "red")
abline(h = mean(y[(n/2 + 1) : n]), lwd = 3, col = "red")
# plot linear fit of y vs x and t
fit <- lm(y ~ x + t)
# plot the two lines corresponding to (t = 0 vs t = 1)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
# add in the actual data points
points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", cex = 1)
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 1)

# print treatment and adjustment effects
rbind("Treatment Effect" = lm(y~t+x)$coef[2], "Adjustment Effect" = lm(y~t)$coef[2])
##                           t
## Treatment Effect  0.1825682
## Adjustment Effect 3.1079122
  • in the above graph, the elements are as follows:
    • \(red\) lines = means for two groups (\(t = 0\) vs \(t = 1\)) \(\rightarrow\) two lines representing lm(y ~ t)
    • \(black\) lines = regression lines for two groups (\(t = 0\) vs \(t = 1\)) \(\rightarrow\) two lines representing lm(y ~ t + x)
      • in this case, both lines correspond to lm(y ~ x) since coefficient of \(t\) or \(\beta_2 = 0\)
    • \(blue\) line = overall regression of \(y\) vs \(x\) \(\rightarrow\) line representing lm(y ~ x)
      • this is overwritten by the \(black\) lines
  • from the graph, we can see that
    • \(x\) variable is highly related to group status \(t\)
      • clear shift in \(x\) with salmon vs light blue groups
    • \(x\) variable is related to \(Y\), but the intercept does not depend on group status \(t\)
      • intercepts for both lines are the same
    • \(x\) variable shows similar relationships to \(Y\) for both groups (\(t = 0\) vs \(t = 1\), or salmon vs lightblue)
      • the \(x\) values of the two groups of points both seem to be linearly correlated with \(Y\)
    • group variable \(t\) is marginally related to \(Y\) when disregarding X
      • \(x\) values capture most of the variation
      • adjustment effect (difference in group means) is very large
    • group variable \(t\) is unrelated or has very little effect on \(Y\)
      • treatment effect is very small or non-existent
      • Note: the groups (\(t = 0\) vs \(t = 1\)) are incomparable since there is no data to inform the relationship between \(t\) and \(Y\)
      • the groups (salmon vs lightblue) don’t have any overlaps so we have no idea how they behave
      • this conclusion is based on the constructed alone

13.3 Simulation 3 - Treatment Reverses Adjustment Effect

  • the following code simulates the linear model, \[Y_i = \beta_0 + \beta_1 x_i + \beta_2 t_i + \epsilon_i\] where \(t = \{0, 1\} \rightarrow\) binary variable
  • in this case, \(\beta_0\) is set to \(0\) \(\rightarrow\) no intercept
# simulate data
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2), .9 + runif(n/2));
# define parameters/coefficients
beta0 <- 0; beta1 <- 2; beta2 <- -1; sigma <- .2
# generate outcome using linear model
y <- beta0 + x * beta1 + t * beta2 + rnorm(n, sd = sigma)
# set up axes
plot(x, y, type = "n", frame = FALSE)
# plot linear fit of y vs x
abline(lm(y ~ x), lwd = 2, col = "blue")
# plot means of the two groups (t = 0 vs t = 1)
abline(h = mean(y[1 : (n/2)]), lwd = 3, col = "red")
abline(h = mean(y[(n/2 + 1) : n]), lwd = 3, col = "red")
# plot linear fit of y vs x and t
fit <- lm(y ~ x + t)
# plot the two lines corresponding to (t = 0 vs t = 1)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
# add in the actual data points
points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", cex = 1)
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 1)

# print treatment and adjustment effects
rbind("Treatment Effect" = lm(y~t+x)$coef[2], "Adjustment Effect" = lm(y~t)$coef[2])
##                            t
## Treatment Effect  -0.9992889
## Adjustment Effect  0.7716072
  • in the above graph, the elements are as follows:
    • \(red\) lines = means for two groups (\(t = 0\) vs \(t = 1\)) \(\rightarrow\) two lines representing lm(y ~ t)
    • \(black\) lines = regression lines for two groups (\(t = 0\) vs \(t = 1\)) \(\rightarrow\) two lines representing lm(y ~ t + x)
    • \(blue\) line = overall regression of \(y\) vs \(x\) \(\rightarrow\) line representing lm(y ~ x)
  • from the graph, we can see that
    • disregarding/adjusting for \(x\), the mean for salmon group is higher than the mean of the blue group (adjustment effect is positive)
    • when adding \(t\) into the linear model, the treatment actually reverses the orders of the group \(\rightarrow\) the intercept of the salmon group is lower than the intercept of the blue group (treatment effect is negative)
      • adjusted estimate is significant and the exact opposite of our unadjusted estimate
    • group variable \(t\) is related to \(x\)
    • some points overlap so it is possible to compare the subsets two groups holding \(x\) fixed

13.4 Simulation 4 - No Adjustment Effect

  • the following code simulates the linear model, \[Y_i = \beta_0 + \beta_1 x_i + \beta_2 t_i + \epsilon_i\] where \(t = \{0, 1\} \rightarrow\) binary variable
  • in this case, \(\beta_0\) is set to \(0\) \(\rightarrow\) no intercept
    • No Adjustment or Marginal Effect
# simulate data
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(.5 + runif(n/2), runif(n/2));
# define parameters/coefficients
beta0 <- 0; beta1 <- 2; beta2 <- 1; sigma <- .2
# generate outcome using linear model
y <- beta0 + x * beta1 + t * beta2 + rnorm(n, sd = sigma)
# set up axes
plot(x, y, type = "n", frame = FALSE)
# plot linear fit of y vs x
abline(lm(y ~ x), lwd = 2, col = "blue")
# plot means of the two groups (t = 0 vs t = 1)
abline(h = mean(y[1 : (n/2)]), lwd = 3, col = "red")
abline(h = mean(y[(n/2 + 1) : n]), lwd = 3, col = "red")
# plot linear fit of y vs x and t
fit <- lm(y ~ x + t)
# plot the two lines corresponding to (t = 0 vs t = 1)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2], lwd = 3)
# add in the actual data points
points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", cex = 1)
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 1)

# print treatment and adjustment effects
rbind("Treatment Effect" = lm(y~t+x)$coef[2], "Adjustment Effect" = lm(y~t)$coef[2])
##                           t
## Treatment Effect  0.9199923
## Adjustment Effect 0.1756044
  • in the above graph, the elements are as follows:
    • \(red\) lines = means for two groups (\(t = 0\) vs \(t = 1\)) \(\rightarrow\) two lines representing lm(y ~ t)
    • \(black\) lines = regression lines for two groups (\(t = 0\) vs \(t = 1\)) \(\rightarrow\) two lines representing lm(y ~ t + x)
    • \(blue\) line = overall regression of \(y\) vs \(x\) \(\rightarrow\) line representing lm(y ~ x)
  • from the graph, we can see that
    • no clear relationship between group variable \(t\) and \(Y\)
      • two groups have similar distributions with respect to \(Y\)
    • treatment effect is substantial
      • separation of regression lines is large when adjusted for x
    • adjustment effect is effectively 0 as there are no large differences between the means of the groups
    • group variable \(t\) is not related to \(x\)
      • distribution of each group (salmon vs light blue) of \(Y\) vs \(X\) is effectively the same
    • lots of direct evidence for comparing two groups holding \(X\) fixed

13.5 Simulation 5 - Binary Interaction

  • the following code simulates the linear model, \[Y_i = \beta_0 + \beta_1 x_i + \beta_2 t_i + \beta_3 x_i t_i + \epsilon_i\] where \(t = \{0, 1\} \rightarrow\) binary variable
  • in this case, \(\beta_0\) and \(\beta_2\) are set to \(0\)
# simulate data
n <- 100; t <- rep(c(0, 1), c(n/2, n/2)); x <- c(runif(n/2, -1, 1), runif(n/2, -1, 1));
# define parameters/coefficients
beta0 <- 0; beta1 <- 2; beta2 <- 0; beta3 <- -4; sigma <- .2
# generate outcome using linear model
y <- beta0 + x * beta1 + t * beta2 + t * x * beta3 + rnorm(n, sd = sigma)
# set up axes
plot(x, y, type = "n", frame = FALSE)
# plot linear fit of y vs x
abline(lm(y ~ x), lwd = 2, col = "blue")
# plot means of the two groups (t = 0 vs t = 1)
abline(h = mean(y[1 : (n/2)]), lwd = 3, col = "red")
abline(h = mean(y[(n/2 + 1) : n]), lwd = 3, col = "red")
# plot linear fit of y vs x and t and interaction term
fit <- lm(y ~ x + t + I(x * t))
# plot the two lines corresponding to (t = 0 vs t = 1)
abline(coef(fit)[1], coef(fit)[2], lwd = 3)
abline(coef(fit)[1] + coef(fit)[3], coef(fit)[2] + coef(fit)[4], lwd = 3)
# add in the actual data points
points(x[1 : (n/2)], y[1 : (n/2)], pch = 21, col = "black", bg = "lightblue", cex = 1)
points(x[(n/2 + 1) : n], y[(n/2 + 1) : n], pch = 21, col = "black", bg = "salmon", cex = 1)

  • in the above graph, the elements are as follows:
    • \(red\) lines = means for two groups (\(t = 0\) vs \(t = 1\)) \(\rightarrow\) two lines representing lm(y ~ t)
    • \(black\) lines = regression lines for two groups (\(t = 0\) vs \(t = 1\)) \(\rightarrow\) two lines representing lm(y ~ t + x + t*x)
    • \(blue\) line = overall regression of \(y\) vs \(x\) \(\rightarrow\) line representing lm(y ~ x)
      • this is completely meaningless in this case
  • from the graph, we can see that
    • treatment effect does not apply since it varies with \(x\)
      • impact of treatment/group variable reverses itself depending on \(x\)
      • compare \(X\) ard 0.0 & \(X\) ard 1.0 \(\rightarrow\) no difference between blue and red & blue higher than red
    • adjustment effect is effectively zero as the means of the two groups are very similar
    • both intercept and slope of the two lines depend on the group variable \(t\)
    • group variable and \(x\) are unrelated
    • lots of information for comparing group effects holding \(x\) fixed

13.6 Simulation 6 - Continuous Adjustment

  • the following code simulates the linear model, \[Y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon_i\]
# simulate data
p <- 1; n <- 100; x2 <- runif(n); x1 <- p * runif(n) - (1 - p) * x2
# define parameters/coefficients
beta0 <- 0; beta1 <- 1; beta2 <- 4 ; sigma <- .01
# generate outcome using linear model
y <- beta0 + x1 * beta1 + beta2 * x2 + rnorm(n, sd = sigma)
# plot y vs x1 and x2
qplot(x1, y) + geom_point(aes(colour=x2)) + geom_smooth(method = lm)

  • in the above graph, we plotted \(y\) vs \(x_{1}\) with \(x_{2}\) denoted as the gradient of color from blue to white
  • to investigate the bivariate relationship more clearly, we can use the following command from the rgl package to generate 3D plots
rgl::plot3d(x1, x2, y)
  • \(y\) vs \(x_{1}\)

  • \(y\) vs \(x_{2}\)

  • \(x_{1}\) vs \(x_{2}\)

  • residual plot with effect of \(x_2\) removed from both \(y\) and \(x_1\)
# plot the residuals for y and x1 with x2 removed
plot(resid(lm(x1 ~ x2)), resid(lm(y ~ x2)), frame = FALSE,
col = "black", bg = "lightblue", pch = 21, cex = 1)
# add linear fit line
abline(lm(I(resid(lm(y ~ x2))) ~ I(resid(lm(x1 ~ x2)))), lwd = 2)

  • from the generated plots above, we can see that
    • \(x_{1}\) is unrelated to \(x_{2}\)
    • \(x_{2}\) strongly correlated to \(y\)
    • adjusted relationship between \(x_{1}\) and \(y\) (loosely correlated \(\rightarrow\) \(R^2\) = 0.06) largely unchanged when \(x_{2}\) is considered
      • \(x_{2}\) captures the vast majority of variation in data
      • there exists almost no residual variability after removing \(x_{2}\)

13.7 Summary and Considerations

  • modeling multivariate relationships is difficult
    • modeling for prediction is fairly straight forward
    • interpreting the regression lines is much harder, as adjusting for variables can have profound effect on variables of interest
  • it is often recommended to explore with simulations to see how inclusion or exclusion of another variable affects the relationship of variable of interest and the outcome
  • variable selection simply affects associations between outcome and predictors, using the model to formulate causal relationship are even more difficult (entire field dedicated to this \(\rightarrow\) causal inference)

14 Residuals and Diagnostics

# load swiss data and
data(swiss)
# run linear regression on Fertility vs all other predictors
fit <- lm(Fertility ~ . , data = swiss)
# generate diagnostic plots in 2 x 2 panels
par(mfrow = c(2, 2)); plot(fit)

14.1 Outliers and Influential Points

  • outlier = an observation that is distant from the other observations of the data set
    • can be results of spurious or real processes
    • can conform to the regression relationship (i.e being marginally outlying in X or Y, but not outlying given the regression relationship)
# generate data
n <- 100; x <- rnorm(n); y <- x + rnorm(n, sd = .3)
# set up axes
plot(c(-3, 6), c(-3, 6), type = "n", frame = FALSE, xlab = "X", ylab = "Y")
# plot regression line for y vs x
abline(lm(y ~ x), lwd = 2)
# plot actual (x, y) pairs
points(x, y, cex = 1, bg = "lightblue", col = "black", pch = 21)
# plot 4 points of interest
points(0, 0, cex = 1.5, bg = "darkorange", col = "black", pch = 21)
points(0, 5, cex = 1.5, bg = "darkorange", col = "black", pch = 21)
points(5, 5, cex = 1.5, bg = "darkorange", col = "black", pch = 21)
points(5, 0, cex = 1.5, bg = "darkorange", col = "black", pch = 21)

  • different outliers can have varying degrees of influence
    • influence = actual effect on model fit
    • leverage = potential for influence
  • in the plot above, we examine 4 different points of interest (in orange)
    • lower left: low leverage, low influence, not an outlier in any sense
    • upper left: low leverage, low influence, classified as outlier because it does not conform to the regression relationship
      • Note: this point, though far away from the rest, does not impact the regression line since it lies in the middle of the data range because the regression line must always pass through the mean/center of observations
    • upper right: high leverage, low influence, classified as outlier because it lies far away from the rest of the data
      • Note: this point has low influence on regression line because it conforms to the overall regression relationship
    • lower right: high leverage, high influence, classified as outlier because it lies far away from the rest of the data AND it does not conform to the regression relationship

14.2 Influence Measures

  • there exists many pre-written functions to measure influence of observations already in the stats package in R
    • Note: typing in ?influence.measures in R will display the detailed documentation on all available functions to measure influence
    • Note: the model argument referenced in the following functions is simply the linear fit model generated by the lm function (i.e. model <- lm(y~x)
    • rstandard(model) = standardized residuals \(\rightarrow\) residuals divided by their standard deviations
    • rstudent(model) = standardized residuals \(\rightarrow\) residuals divided by their standard deviations, where the \(i^{th}\) data point was deleted in the calculation of the standard deviation for the residual to follow a t distribution
    • hatvalues(model) = measures of leverage; “how far it is away”
    • dffits(model) = change in the predicted response when the \(i^{th}\) point is deleted in fitting the model
      • effectively measures influence of a point on prediction
    • dfbetas(model) = change in individual coefficients when the \(i^{th}\) point is deleted in fitting the model, standardized by a deleted estimate of the coefficient standard error
      • effectively measures influence of the individual coefficients
    • dfbeta(model) = difference in coefficients for including vs excluding each data point (unscaled version of dfbetas, ie. not standardized)
    • cooks(model).distance = overall change in coefficients when the \(i^{th}\) point is deleted
    • resid(model) = returns ordinary residuals
    • resid(model)/(1-hatvalues(model)) = returns PRESS residuals (i.e. the leave one out cross validation residuals)
      • PRESS residuals measure the differences in the response and the predicted response at data point \(i\), where it was not included in the model fitting
      • effectively measures the prediction error based on model constructed with every other point but the one of interest
  • Note: swirl lesson digs deeper into the manual calculations of dfbeta, hatvalues, rstandard, rstudent and cooks(model).distance

14.3 Using Influence Measures

  • The purpose of these functions are to probe the given data in different ways to diagnose different problems
    • patterns in residual plots (most important tool) \(\rightarrow\) generally indicate some poor aspect of model fit
      • heteroskedasticity \(\rightarrow\) non-constant variance
      • missing model terms
      • temporal patterns \(\rightarrow\) residuals versus collection order/index exhibit pattern
    • residual Q-Q plots plots theoretical quantile vs actual quantiles of residuals
      • investigates whether the errors follow the standard normal distribution
    • leverage measures (hat values) measures the potential to influence the regression model
      • only depend on \(x\) or predictor variables
      • can be useful for diagnosing data entry errors
    • influence measures (i.e. dfbetas) measures actual influence of points on a particular aspect of the regression model
      • evaluates how deleting or including this point impact a particular aspect of the model
  • Be wary of simplistic rules for diagnostic plots and measures. The use of these tools is context specific. It’s better to understand what they are trying to accomplish and use them judiciously (have a hollistic view).
  • Not all measures have meaningful absolute scales, so it may be useful to apply these measures to different values in the same data set but almost never to different datasets

14.4 Example - Outlier Causing Linear Relationship

# generate random data and point (10, 10)
x <- c(10, rnorm(n)); y <- c(10, c(rnorm(n)))
# plot y vs x
plot(x, y, frame = FALSE, cex = 1, pch = 21, bg = "lightblue", col = "black")
# perform linear regression
fit <- lm(y ~ x)
# add regression line to plot
abline(fit)

  • data generated
    • 100 points are randomly generated from the standard normal distribution
    • point (10, 10) added to the data set
  • there is no regression relationship between X and Y as the points are simply random noise
  • the regression relationship was able to be constructed precisely because of the presence of the point (10, 10)
    • \(R^2\) = 0.3212155
    • a single point has created a strong regression relationship where there shouldn’t be one
      • point (10, 10) has high leverage and high influence
    • we can use diagnostics to detect this kind of behavior
  • dfbetas(fit) = difference in coefficients for including vs excluding each data point
    • the function will return a n x m dataframe, where n = number of values in the original dataset, and m = number of coefficients
    • for this example, the coefficients are \(\beta_0\) (intercept), and \(\beta_1\) (slope), and we are interested in the slope
# calculate the dfbetas for the slope the first 10 points
round(dfbetas(fit)[1 : 10, 2], 3)
##      1      2      3      4      5      6      7      8      9     10 
##  6.859  0.135 -0.048 -0.049  0.064 -0.015 -0.022 -0.037  0.009 -0.058
  • as we can see from above, the slope coefficient would change dramatically if the first point (10, 10) is left out

  • hatvalues(fit) = measures the potential for influence for each point

# calculate the hat values for the first 10 points
round(hatvalues(fit)[1 : 10], 3)
##     1     2     3     4     5     6     7     8     9    10 
## 0.549 0.022 0.025 0.017 0.012 0.010 0.012 0.011 0.010 0.012
  • again, as we can see from above, the potential for influence is very large for the first point (10, 10)

Manual calculations of Influence Measures

  • dfbeta(fit) = difference in coefficients for including vs excluding each data point (unscaled)
# exclude outlier in the first element
fitno <- lm(y ~ x, data.frame(y = y[-1], x = x[-1]))
# manually calculate the difference in coefficients
df.diff <- coef(fit)-coef(fitno)
# using dfbeta function to do the same
df.beta <- dfbeta(fit)[1,]
rbind("manually" = df.diff, "dfbeta" = df.beta)
##          (Intercept)         x
## manually  0.03323954 0.4935411
## dfbeta    0.03323954 0.4935411
  • manually calculate hatvalue for the 1st element (outlier)
    • hatvalue = 1 minus the ratio of residuals (is a measure of influence and a value between 1 and 0)
      • near 0 for points which are not influential
      • near 1 for points which are influential
# calculate residual from model without outlier by plugging outlier into fitted model
resno <- y[1] - predict(fitno, data.frame(x = x[1]))
# 1 minus ratio of residual of fit inclusive outlier (numerator) to residual of fit exclusive outlier (denominator)
hat.manual <- 1-resid(fit)[1]/resno
# using hatvalues function to do the same
hat.values <- hatvalues(fit)[1]
rbind("manually" = hat.manual, "hatvalues" = hat.values)
##                   1
## manually  0.5486468
## hatvalues 0.5486468
  • Residuals of individual samples are sometimes treated as having the same variance, which is estimated as the sample variance of the entire set of residuals. Theoretically, however, residuals of individual samples have different variances and these differences can become large in the presence of outliers
    • Standardized and Studentized residuals attempt to compensate for this effect in two slightly different ways. Both use hat values
  • rstandard(model) = standardized residuals \(\rightarrow\) residuals divided by their standard deviations
# Calculate standardized residuals
# First, calculate the sample standard deviation of fit's residual by dividing fit's deviance
# i.e., its residual sum of squares, by the residual degrees of freedom and taking the square root
sigma <- sqrt(deviance(fit)/df.residual(fit))
# estimate standard deviations of individual samples (=standardized residuals)
rstd <- resid(fit)/(sigma * sqrt(1-hatvalues(fit)))
# using rstandard function to do the same as above
r.standard <- rstandard(fit)
rbind("manually" = rstd[1], "rstandard" = r.standard[1])
##                  1
## manually  5.328772
## rstandard 5.328772
  • rstudent(model) = standardized residuals \(\rightarrow\) residuals divided by their standard deviations, where the \(i^{th}\) data point was deleted in the calculation of the standard deviation for the residual to follow a t distribution
# Calculate (externally) Studentized residuals (in this example we leave the outlier out)
# Recalling that the model we called fitno omits the outlier sample
# Calculate the sample standard deviation of fitno's residual by dividing its deviance, by its
# residual degrees of freedom and taking the square root
sigma1 <- sqrt(deviance(fitno)/df.residual(fitno))
# Calculate the Studentized residual for the outlier sample
rstdt <- resid(fit)/(sigma1 * sqrt(1-hatvalues(fit)))
# using rstudent function to do the same as above
r.student <- rstudent(fit)
rbind("manually" = rstdt[1], "rstudent" = r.student[1])
##                 1
## manually 6.278053
## rstudent 6.278053
  • cooks(model).distance = overall change in coefficients when the \(i^{th}\) point is deleted
    • It is essentially the sum of squared differences between values fitted with and without a particular sample
    • It is normalized (divided by) residual sample variance times the number of predictors (in our example 2, that is intercept and x)
    • It essentially tells how much a given sample changes a model
# Calculate the difference in predicted values between fit and fitno,
# the models which respectively include and omit the outlier
dy <- predict(fitno, data.frame(x = x)) - predict(fit, data.frame(x = x))
# Calculate the outlier's Cook's distance
cook.manual <- sum(dy^2)/(2*sigma^2)
# using cook.distance function to do the same as above
cook.dist <- cooks.distance(fit)[1]
rbind("manually" = cook.manual, "cook.distance" = cook.dist)
##                     1
## manually      17.2584
## cook.distance 17.2584

14.5 Example - Real Linear Relationship

# generate data
x <- rnorm(n); y <- x + rnorm(n, sd = .3)
# add an outlier that fits the relationship
x <- c(5, x); y <- c(5, y)
# plot the (x, y) pairs
plot(x, y, frame = FALSE, cex = 1, pch = 21, bg = "lightblue", col = "black")
# perform the linear regression
fit2 <- lm(y ~ x)
# add the regression line to the plot
abline(fit2)

  • data generated
    • 100 directly correlated points are generated
    • point (5, 5) added to the data set
  • there is a linear relationship between X and Y
    • \(R^2\) = 0.9450935
    • point (5, 5) has high leverage and low influence
# calculate the dfbetas for the slope the first 10 points
round(dfbetas(fit2)[1 : 10, 2], 3)
##      1      2      3      4      5      6      7      8      9     10 
##  0.079  0.068  0.043 -0.069 -0.021 -0.021  0.030  0.007 -0.006 -0.028
# calculate the hat values for the first 10 points
round(hatvalues(fit2)[1 : 10], 3)
##     1     2     3     4     5     6     7     8     9    10 
## 0.175 0.035 0.013 0.015 0.013 0.014 0.020 0.019 0.013 0.025
  • as we can see from above, the point (5, 5) no longer has a large dfbetas value (indication of low influence) but still has a substantial hatvalue (indication of high leverage)
    • this is in line with our expectations

14.6 Example - Stefanski TAS 2007

  • taken from Leonard A. Stefanski’s paper Residual (Sur)Realism
  • data set can be found here
  • the data itself exhibit no sign of correlation between the variables
# read data
data <- read.table('http://www4.stat.ncsu.edu/~stefanski/NSF_Supported/Hidden_Images/orly_owl_files/orly_owl_Lin_4p_5_flat.txt',
header = FALSE)
# construct pairwise plot
pairs(data)

# perform regression on V1 with all other predictors (omitting the intercept)
fit <- lm(V1 ~ . - 1, data = data)
# print the coefficient for linear model
summary(fit)$coef
##     Estimate Std. Error   t value     Pr(>|t|)
## V2 0.9856157 0.12798121  7.701253 1.989126e-14
## V3 0.9714707 0.12663829  7.671225 2.500259e-14
## V4 0.8606368 0.11958267  7.197003 8.301184e-13
## V5 0.9266981 0.08328434 11.126919 4.778110e-28
  • as we can see from above, the p-values for the coefficients indicate that they are significant
  • if we take a look at the residual plot, an interesting pattern appears
# plot the residuals vs fitted values
plot(predict(fit), resid(fit), pch = '.')

15 Model Selection

“A model is a lense through which to look at your data” – Scott Zeger

“There’s no such thing as a correct model” – Brian Caffo

15.1 Rumsfeldian Triplet

“There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know.” – Donald Rumsfeld

  • known knowns = regressors that we know and have, which will be evaluated to be included in the model
  • known unknowns = regressors that we don’t have but would like to include in the model
    • didn’t or couldn’t collect the data
  • unknown unknowns = regressors that we don’t know about that we should have included in the model

15.2 General Rules

  • omitting variables \(\rightarrow\) generally results in increased bias in the coefficients of interest
    • exceptions are when the omitted variables are uncorrelated with the regressors (variables of interest/included in model)
      • Note: this is why randomize treatments/trials/experiments are the norm; it’s the best strategy to balance confounders, or maximizing the probability that the treatment variable is uncorrelated with variables not in the model
      • often times, due to experiment conditions or data availability, we cannot randomize
      • however, if there are too many unobserved confounding variables, even randomization won’t help
  • including irrelevant/unnecessary variables \(\rightarrow\) generally increases standard errors (estimated standard deviation) of the coefficients
    • Note: including any new variables increases true standard errors of other regressors, so it is not wise to idly add variables into model
  • whenever highly correlated variables are included in the same model \(\rightarrow\) the standard error and therefore the variance of the model increases \(\rightarrow\) this is known as variance inflation
    • actual increase in standard error of coefficients for adding a regressor = estimated by the ratio of the estimated standard errors minus 1, or in other words \[\Delta_{\sigma~|~adding~x_2} = \frac{\hat \sigma_{y \sim x_1+x_2}}{\hat \sigma_{y \sim x_1}} - 1\] for all coefficients for the regression model
      • Example: if standard error of the \(\beta_1\) of y~x1+x2 = 1.5 and standard error for the \(\beta_1\) of y~x1 = 0.5, then the actual increase in standard error of the \(\beta_1\) = 1.5/0.5 - 1 = 200%
    • Note: when the regressors added are orthogonal (statistically independent) to the regressor of interest, then there is no variance inflation
      • variance inflation factor (VIF) = the increase in the variance for the \(i_{th}\) regressor compared to the ideal setting where it is orthogonal to the other regressors
      • \(\sqrt{VIF}\) = increase in standard error
    • Note: variance inflation is only part of the picture in that sometimes we will include variables even though they dramatically inflate the variation because it is an empirical part of the relationship we are attempting to model
  • as the number of non-redundant variables approaches \(n\), the model approaches a perfect fit for the data
    • \(R^2\) increases monotonically as more regressors are included
    • Sum of Squared Errors (SSE) decreases monotonically as more regressors are included

15.3 Example - \(R^2\) v \(n\)

  • for the simulation below, no actual regression relationship exist as the data generated are simply standard normal noise
  • it is clear, however, that as \(p\), the number of regressors included in the model, approaches \(n\), the \(R^2\) value approaches 1.0, which signifies perfect fit
# set number of measurements
n <- 100
# set up the axes of the plot
plot(c(1, n), 0 : 1, type = "n", xlab = "p", ylab = expression(R^2),
main = expression(paste(R^2, " vs n")))
# for each value of p from 1 to n
r <- sapply(1 : n, function(p){
# create outcome and p regressors
y <- rnorm(n); x <- matrix(rnorm(n * p), n, p)
# calculate the R^2
summary(lm(y ~ x))$r.squared
})
# plot the R^2 values and connect them with a line
lines(1 : n, r, lwd = 2)

15.4 Adjusted \(R^2\)

  • recall that \(R^2\) is defined as the percent of total variability that is explained by the regression model, or \[R^2 = \frac{\mbox{regression variation}}{\mbox{total variation}} = 1- \frac{\sum_{i=1}^n (Y_i - \hat Y_i)^2}{\sum_{i=1}^n (Y_i - \bar Y)^2} = 1 - \frac{Var(e_i)}{Var(Y_i)}\]
  • Estimating \(R^2\) with the above definition is acceptable when there is a single variable, but it becomes less and less helpful as the number of variables increases
    • as we have shown previously, \(R^2\) always increases as more variables are introduced and is thus biased
  • adjusted \(R^2\) is a better estimate of variability explained by the model and is defined as \[R^2_{adj} = 1 - \frac{Var(e_i)}{Var(Y_i)} \times \frac{n-1}{n-k-1}\] where \(n\) = number of observations, and \(k\) = number of predictors in the model
    • since \(k\) is always greater than zero, the adjusted \(R^2\) is always smaller than the unadjusted \(R^2\)
    • adjusted \(R^2\) also penalizes adding large numbers of regressors, which would have inflated the unadjusted \(R^2\)

15.5 Example - Unrelated Regressors

  • in the simulation below, outcome \(y\) is only related to \(x_1\)
    • \(x_2\) and \(x_3\) are random noise
  • we will run 1000 simulations of 3 linear regression models, and calculate the standard error of the slope
    • \(y\) vs \(x_1\)
    • \(y\) vs \(x_1 + x_2\)
    • \(y\) vs \(x_1 + x_2 + x_3\)
# simulate data
n <- 100; nosim <- 1000
# generate 3 random noise, unrelated variables
x1 <- rnorm(n); x2 <- rnorm(n); x3 <- rnorm(n);
# calculate beta1s of three different regression
betas <- sapply(1 : nosim, function(i){
# generate outcome as only related to x1
y <- x1 + rnorm(n, sd = .3)
# store beta1 of linear regression on y vs x1
c(coef(lm(y ~ x1))[2],
# store beta1 of linear regression on y vs x1 and x2
coef(lm(y ~ x1 + x2))[2],
# store beta1 of linear regression on y vs x1 x2 and x3
coef(lm(y ~ x1 + x2 + x3))[2])
})
# calculate the standard error of the beta1s for the three regressions
beta1.se <- round(apply(betas, 1, sd), 5)
# print results
rbind("y ~ x1" = c("beta1SE" = beta1.se[1]),
"y ~ x1 + x2" = beta1.se[2],
"y ~ x1 + x2 + x3" = beta1.se[3])
##                  beta1SE.x1
## y ~ x1              0.02995
## y ~ x1 + x2         0.02997
## y ~ x1 + x2 + x3    0.02998
  • as we can see from the above result, if we include unrelated/uncorrelated regressors \(x_2\) and \(x_3\), the standard error remains the same, in other words there is no variance inflation. This is because the other regressors, x2 and x3, are uncorrelated with the regressor of interest, x1.

15.6 Example - Highly Correlated Regressors / Variance Inflation

  • in the simulation below, outcome \(y\) is related to \(x_1\)
    • \(x_2\) and \(x_3\) are highly correlated with \(x_1\)
    • \(x_3\) is more correlated with \(x_1\) than \(x_2\)
  • we will run 1000 simulations of 3 linear regression models, and calculate the standard error of \(\beta_1\), the coefficient of \(x_1\)
    • \(y\) vs \(x_1\)
    • \(y\) vs \(x_1 + x_2\)
    • \(y\) vs \(x_1 + x_2 + x_3\)
# generate number of measurements and trials
n <- 100; nosim <- 1000
# generate random variables that are correlated with x1
x1 <- rnorm(n)
x2 <- x1/sqrt(2) + rnorm(n) /sqrt(2)
x3 <- x1 * 0.95 + rnorm(n) * sqrt(1 - 0.95^2)
# calculate the beta1s for 1000 trials
betas <- sapply(1 : nosim, function(i){
# generate outcome as only related to x1
y <- x1 + rnorm(n, sd = .3)
# store beta1 of linear regression on y vs x1
c(coef(lm(y ~ x1))[2],
# store beta1 of linear regression on y vs x1 and x2
coef(lm(y ~ x1 + x2))[2],
# store beta1 of linear regression on y vs x1 x2 and x3
coef(lm(y ~ x1 + x2 + x3))[2])
})
# calculate the standard error of the beta1 for the three regressions
beta1.se <- round(apply(betas, 1, sd), 5)
# print results
rbind("y ~ x1" = c("beta1SE" = beta1.se[1]),
"y ~ x1 + x2" = beta1.se[2],
"y ~ x1 + x2 + x3" = beta1.se[3])
##                  beta1SE.x1
## y ~ x1              0.03344
## y ~ x1 + x2         0.04701
## y ~ x1 + x2 + x3    0.09177
  • as we can see from above, adding highly correlated regressors drastically increases the standard errors of the coefficients

15.7 Variance inflation factors

  • Notice variance inflation was much worse when we included a variable that was highly related to x1
  • In these two simulations we had 1000 samples of estimated coefficients, hence could calculate sample variance in order to illustrate the effect. In a real case, we have only one set of coefficients and we depend on theoretical estimates. However, theoretical estimates contain an unknown constant of proportionality. We therefore depend on ratios of theoretical estimates called Variance Inflation Factors (VIF)
  • We don’t know \(\sigma\), so we can only estimate the increase in the actual standard error of the coefficients for including a regressor
  • However, \(\sigma\) drops out of the relative standard errors. If one sequentially adds variables, one can check the variance (or sd) inflation for including each one
  • When the other regressors are actually orthogonal to the regressor of interest, then there is no variance inflation
  • The variance inflation factor (VIF) is the increase in the variance for the ith regressor compared to the ideal setting where it is orthogonal to the other regressors
    • it is the ratio of estimated variances, the variance due to including the ith regressor, divided by that due to including a corresponding ideal regressor which is uncorrelated with the others
    • (The square root of the VIF is the increase in the sd …)
  • Remember, variance inflation is only part of the picture. We want to include certain variables, even if they dramatically inflate our variance
  • Revisit example:
    • to estimate the actual change in variance, we can use the ratio of estimated variances for the \(\beta_1\) coefficient for the different models
      • summary(fit)$cov.unscaled = returns p x p covariance matrix for p coefficients, with the diagonal values as the true variances of coefficients
      • summary(fit)$cov.unscaled[2,2] = true variance for the \(\beta_1\)
# generate outcome that is correlated with x1
y <- x1 + rnorm(n, sd = .3)
# store the variance of beta1 for the 1st model
a <- summary(lm(y ~ x1))$cov.unscaled[2,2]
# calculate the ratio of variances of beta1 for 2nd and 3rd models with respect to 1st model
c(summary(lm(y ~ x1 + x2))$cov.unscaled[2,2],
summary(lm(y~ x1 + x2 + x3))$cov.unscaled[2,2]) / a - 1
## [1] 0.9186974 6.7619190
# alternatively, the change in variance can be estimated by calculating ratio of trials variance
temp <- apply(betas, 1, var); temp[2 : 3] / temp[1] - 1
##        x1        x1 
## 0.9769326 6.5328874
  • as we can see from the above results
    • adding \(x_2\) increases the variance by approximately 1 fold
    • adding \(x_2\) and \(x_3\) increases the variance by approximately 7 folds
  • the estimated values from the 1000 trials are different but close to the true increases, and they will approach the true values as the number of trials increases

15.8 Example: Variance Inflation Factors

  • we will use the swiss data set for this example, and compare the following models
    • Fertility vs Agriculture
    • Fertility vs Agriculture + Examination
    • Fertility vs Agriculture + Examination + Education
# load swiss data
data(swiss)
# run linear regression for Fertility vs Agriculture
fit <- lm(Fertility ~ Agriculture, data = swiss)
# variance for coefficient of Agriculture
a <- summary(fit)$cov.unscaled[2,2]
# run linear regression for Fertility vs Agriculture + Examination
fit2 <- update(fit, Fertility ~ Agriculture + Examination)
# run linear regression for Fertility vs Agriculture + Examination + Education
fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)
# calculate ratios of variances for Agriculture coef for fit2 and fit3 w.r.t fit1
c(summary(fit2)$cov.unscaled[2,2], summary(fit3)$cov.unscaled[2,2]) / a - 1
## [1] 0.8915757 1.0891588
  • as we can see from above
    • adding Examination variable to the model increases the variance by 89%
    • adding Examination and Education variables to the model increases the variance by 109%
  • we can also calculate the variance inflation factors for all the predictors and see how variance will change by adding each predictor (assuming all predictor are orthogonal/independent of each other)
    • [car library] vit(fit) = returns the variance inflation factors for the predictors of the given linear model
# load car library
library(car)
# run linear regression on Fertility vs all other predictors
fit <- lm(Fertility ~ . , data = swiss)
# calculate the variance inflation factors
vif(fit)
##      Agriculture      Examination        Education         Catholic 
##         2.284129         3.675420         2.774943         1.937160 
## Infant.Mortality 
##         1.107542
# calculate the standard error inflation factors
sqrt(vif(fit))
##      Agriculture      Examination        Education         Catholic 
##         1.511334         1.917138         1.665816         1.391819 
## Infant.Mortality 
##         1.052398
  • as we can see from the above results, the variance inflation for the education is 2.774943
    • it means the variance in the estimated coefficient of Education is 2.774943 times what it might have been if Education were not correlated with the other regressors
  • Since Education and score on an Examination are likely to be correlated, we might guess that most of the variance inflation for Education is due to including Examination
  • Infant mortality is largely going to be unrelated to any of these other variables, by and large, and so you can see that it doesn’t have a large variance inflation factor
# excluding variable Examination
fit2 <- lm(Fertility ~ . -Examination, swiss)
# calculate the variance inflation factors
vif(fit2)
##      Agriculture        Education         Catholic Infant.Mortality 
##         2.147153         1.816361         1.299916         1.107528
  • As expected, omitting Examination has markedly decreased the VIF for Education, from 2.774943 to 1.816361. Note that omitting Examination has had almost no effect on the VIF for Infant Mortality. Chances are Examination and Infant Mortality are not strongly correlated

Summary VIF

  • VIF describes the increase in the variance of a coefficient due to the correlation of its regressor with the other regressors
  • VIF is the square of standard error inflation
  • If a regressor is strongly correlated with others, hence will increase their VIF’s, but it doesn’t mean we should exclude it. This is because excluding it might bias coefficient estimates of regressors with which it is correlated
  • The problems of variance inflation and bias due to excluded regressors both involve correlated regressors. There are methods, such as factor analysis or principal componenent analysis, which can convert regressors to an equivalent uncorrelated set
  • However, using converted regressors may make interpretation difficult

15.9 Residual Variance Estimates

  • assuming that the model is linear with additive iid errors (with finite variance), we can mathematically describe the impact of omitting necessary variables or including unnecessary ones
    • underfitting the model (omit important variables) \(\rightarrow\) variance estimate is biased \(\rightarrow ~ E[\hat \sigma^2] \neq \sigma^2\)
      • biased because we’ve attributed to variation things that are actually systematically explained by these co-variants that we’ve omitted.
    • correctly fitting or overfitting the model \(\rightarrow\) variance estimate is unbiased \(\rightarrow ~ E[\hat \sigma^2] = \sigma^2\)
      • however, if unnecessary variables are included, the variance estimate is larger than that of the correctly fitted variables \(\rightarrow\) \(Var(\hat \sigma_{overfitted}) \geq Var(\hat \sigma_{correct})\)
      • in other words, adding unnecessary variables increases the variability of estimate for the true model

15.10 Example - Bias

  • The following example shows how omitting a correlated regressor can bias estimates of a coefficient
set.seed(8765)
temp <- rnorm(100)
# x1 and x2 are correlated
x1 <- (temp + rnorm(100))/sqrt(2)
x2 <- (temp + rnorm(100))/sqrt(2)
x3 <- rnorm(100)
# Function to simulate regression of y on 2 variables.
f <- function(k){
# generate y (Note: the actual coefficient of x1 is 1)
y <- x1 + x2 + x3 + .3*rnorm(100)
# regression with 2 regressors
c(lm(y ~ x1 + x2)$coef[2], # omit uncorrelated variable
lm(y ~ x1 + x3)$coef[2]) # omit correlated variable
}
# beta1 values from the 2 regressions
x1c <- sapply(1:150, f)
# calculate the means of beta1 values
apply(x1c, 1, mean)
##       x1       x1 
## 1.034403 1.476944
# plot the histogram of beta1 values
p1 <- hist(x1c[1,], plot=FALSE)
p2 <- hist(x1c[2,], plot=FALSE)
yrange <- c(0, max(p1$counts, p2$counts))
plot(p1, col=rgb(0,0,1,1/4), xlim=range(x1c), ylim=yrange, xlab="Estimated coefficient of x1",
main="Bias Effect of Omitted Regressor")
plot(p2, col=rgb(1,0,0,1/4), xlim=range(x1c), ylim=yrange, add=TRUE)
legend(1.1, 40, c("Uncorrelated regressor, x3, omitted", "Correlated regressor, x2, omitted"),
fill=c(rgb(0,0,1,1/4), rgb(1,0,0,1/4)))

  • The first row of the matrix x1c contains independent estimates of x1’s coefficient in the case that x3, the regressor uncorrelated with x1, is omitted
  • The second row contains estimates of x1’s coefficient when the correlated regressor, x2, is omitted
  • Having been warned that omitting a correlated regressor would bias estimates of x1’s coefficient, we would expect the mean estimate of x1c’s second row (omitting correlated regressor) to be farther from 1 than the mean of x1c’s first row (omitting uncorrelated regressor). We do see this.
  • The bias is also evident in the histogram of x1’s coefficient. Estimates from the second row are clearly more than two standard deviations from the correct value of 1.

15.11 Covariate Model Selection

  • automated covariate/predictor selection is difficult
    • the space of models explodes quickly with interactions and polynomial terms
    • Note: in the Practical Machine Learning class, many modern methods for traversing large model spaces for the purposes of prediction will be covered
  • principal components analysis (PCA) or factor analytic models on covariates are often useful for reducing complex covariate spaces
    • find linear combinations of variables that captures the most variation
  • good experiment design can often eliminate the need for complex model searches during analyses
    • randomization, stratification can help simply the end models
    • unfortunately, control over the design is often limited
  • it is also viable to manually explore the covariate space based on understanding of the data
    • use covariate adjustment and multiple models to probe that effect of adding a particular predictor on the model
    • Note: this isn’t a terribly systematic or efficient approach, but it tends to teach you a lot about the data through the process
  • if the models of interest are nested (i.e. one model is a special case of another with one or more coefficients set to zero) and without lots of parameters differentiating them, it’s fairly possible to use nested likelihood ratio tests (ANOVA) to help find the best model
    • Analysis of Variance (ANOVA) works well when adding one or two regressors at a time
      • anova(fit1, fit2, fit3) = performs ANOVA or analysis of variance (or deviance) tables for a series of nested linear regressions models
    • Note: it is extremely important to get the order of the models correct to ensure the results are sensible
    • an example can be found here
    • Also, we know as the number of regressors in the model \(p\) approaches \(n\), the model approaches perfect fit for the data. In other words adding random regressors decreases residual squared errors. But we would be mistaken to believe that such decreases are significant. To assess significance, we should take into account that adding regressors reduces residual degrees of freedom. Analysis of variance (ANOVA) is a useful way to quantify the significance of additional regressors
  • another alternative to search through different models is the step-wise search algorithm that repeatedly adds/removes regressors one at a time to find the best model with the least Akaike Information Criterion (AIC)
    • step(lm, k=df) = performs step wise regression on a given linear model to find and return best linear model
      • k=log(n) = specifying the value of k as the log of the number of observation will force the step-wise regression model to use Bayesian Information Criterion (BIC) instead of the AIC
      • Note: both BIC and AIC penalizes adding parameters to the regression model with an additional penalty term; the penalty is larger for BIC than AIC
    • MASS::stepAIC(lm, k = df) = more versatile, rigorous implementation of the step wise regression
    • an example can be found here
  • My favoriate approach is as follows: Given a coefficient that I’m interested in, I like to use covariate adjustment and multiple models to probe that effect to evaluate it for robustness and to see what other covariates knock it out. This isn’t a terribly systematic approach, but it tends to teach you a lot about the the data as you get your hands dirty.

15.12 Example: ANOVA

  • Nested model testing in R. In ANOVA the null hypothesis is that the added regressors are not significant
  • we will use the swiss data set for this example, and compare the following nested models
    1. Fertility vs Agriculture
    2. Fertility vs Agriculture + Examination + Education
    3. Fertility vs Agriculture + Examination + Education + Catholic + Infant.Mortality
# three different regressions that are nested
fit1 <- lm(Fertility ~ Agriculture, data = swiss)
fit3 <- update(fit, Fertility ~ Agriculture + Examination + Education)
fit5 <- update(fit, Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality)
# perform test using ANOVA
anova(fit1, fit3, fit5)
## Analysis of Variance Table
## 
## Model 1: Fertility ~ Agriculture
## Model 2: Fertility ~ Agriculture + Examination + Education
## Model 3: Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     45 6283.1                                  
## 2     43 3180.9  2    3102.2 30.211 8.638e-09 ***
## 3     41 2105.0  2    1075.9 10.477 0.0002111 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RSS's are the sum residual sum of squares, eg. for fit3
deviance(fit3)
## [1] 3180.925
  • the ANOVA function returns a formatted table with the follow information
    • Res.Df = residual degrees of freedom for the models
    • RSS = residual sum of squares for the models, measure of fit
    • Df = change in degrees of freedom from one model to the next
    • Sum of Sq = difference/change in residual sum of squares from one model to the next
    • F = F statistic, measures the ratio of two scaled sums of squares divided by their respective degrees of freedom, reflecting different sources of variability \[F = \frac{\frac{RSS_1 - RSS_2}{p_2 - p_1}}{\frac{RSS_2}{n-p_2}}\] where \(p_1\) and \(p_2\) = number of parameters (= number of coefficients + 1 intercept, eg. for fit3 p = 4, for fit5 p = 6) in the two models for comparison, and \(n\) = number of observations
    • \(RSS_1\) and \(RSS_2\) have mean zero hence are centrally chi-squared provided the residuals themselves are normally distributed
    • Pr(>F) = p-value for the F statistic to indicate whether the change in model is significant or not
    • Rejection is based on a right-tailed F test, Pr(>F), applied to an F value
  • from the above result, we can see that both going from first to second, and second to third models result in significant reductions in RSS and better model fits
  • Furthermore, the null hypothesis tests are rejected at the 0.001 level, so at least one of the two additional regressors is significant (fit1 to fit2, as well as fit2 to fit3).
  • We are confident that fit3 is significantly better than fit1 (test is signficant), with one caveat: analysis of variance is sensitive to its assumption that model residuals are approximately normal. If they are not, we could get a small p-value for that reason. It is thus worth testing residuals for normality. The Shapiro-Wilk test is quick and easy in R (Normality is its null hypothesis)
shapiro.test(fit3$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit3$residuals
## W = 0.97276, p-value = 0.336
shapiro.test(fit5$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  fit5$residuals
## W = 0.98892, p-value = 0.9318
  • The Shapiro-Wilk test fails to reject the null (normality), supporting confidence in our analysis of variance

16 General Linear Models Overview

16.1 Simple Linear Model

  • exponential family distribution: Gaussian distribution, assumed \(Y_i \sim N(\mu_i, \sigma^2)\)
  • linear predictor: \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\)
  • link function as \(g\) so that $g() =
    • for linear models, \(g(\mu) = \mu\) so that \(\eta_i = \mu_i\)
  • result: the same likelihood model (see derivation) as the additive Gaussian error linear model \[Y_i = \sum_{k=1}^p X_{ik} \beta_k + \epsilon_{i}\] where \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma^2)\)

16.2 Logistic Regression

  • exponential family distribution: binomial/Bernoulli distribution assumed, that is \(Y_i \sim Bernoulli(\mu_i)\) where the probability of success is \(\mu_i\)
    • due to the properties of the binomial/Bernoulli distribution, \(E[Y_i] = \mu_i\) where \(0 \leq \mu_i \leq 1\)
  • linear predictor: \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\)
  • link function : \(g(\mu) = \eta = \log\left(\frac{\mu}{1 - \mu}\right)\)
    • odds for success for a binomial/Bernoulli distribution is defined as \[\mbox{odds} = \frac{p}{1-p}\]
    • logit is defined as \[\log(\mbox{odds}) = \log \frac{\mu}{1-\mu}\]
      • Note: the \(\log\) here is the natural log
      • Note: We’re transforming the mean of the distribution, and not the Ys themselves
    • inverse logit is defined as \[\mu_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)}\]
    • complement of inverse logit is \[1 - \mu_i = \frac{1}{1 + \exp(\eta_i)}\]
  • result: the likelihood model \[\begin{aligned} L(\beta) & = \prod_{i=1}^n \mu_i^{y_i} (1 - \mu_i)^{1-y_i} \\ (plug~in~\mu_i~and~1 - \mu_i~from~above)& = \prod_{i=1}^n \left(\frac{\exp(\eta_i)}{1 + \exp(\eta_i)}\right)^{y_i} \left(\frac{1}{1 + \exp(\eta_i)}\right)^{1-y_i}\\ (multiply~2^{nd}~term~by~\frac{exp(\eta_i)}{exp(\eta_i)})& = \prod_{i=1}^n \left(\frac{\exp(\eta_i)}{1 + \exp(\eta_i)}\right)^{y_i} \left(\frac{\exp(\eta_i)}{1 + \exp(\eta_i)}\right)^{1-y_i} \left(\frac{1}{\exp(\eta_i)}\right)^{1-y_i}\\ (simplify) & = \prod_{i=1}^n \left(\frac{\exp(\eta_i)}{1 + \exp(\eta_i)}\right) \left(\frac{1}{\exp(\eta_i)}\right)^{1-y_i}\\ (simplify) & = \prod_{i=1}^n \left(\frac{\exp(\eta_i)}{1 + \exp(\eta_i)}\right) \exp(\eta_i)^{y_i-1}\\ (simplify) & = \prod_{i=1}^n \frac{\exp(\eta_i)^{y_i}}{1 + \exp(\eta_i)} \\ (change~form~of~numerator) & = \exp\left(\sum_{i=1}^n y_i \eta_i \right)\prod_{i=1}^n \frac{1}{1 + \exp(\eta_i)}\\ (substitute~\eta_i) \Rightarrow L(\beta) & = \exp\left(\sum_{i=1}^n y_i \big(\sum_{k=1}^p X_{ik} \beta_k \big) \right)\prod_{i=1}^n \frac{1}{1 + \exp\big(\sum_{k=1}^p X_{ik} \beta_k \big)}\\ \end{aligned}\]
    • maximizing the likelihood \(L(\beta)\) (solving for \(\frac{\partial L}{\partial \beta} = 0)\) would return a set of optimized coefficients \(\beta\) that will fit the data

16.3 Poisson Regression

  • exponential family distribution: Poisson distribution assumed, that is \(Y_i \sim Poisson(\mu_i)\) so that \(E[Y_i] = \mu_i\) where \(0\leq \mu_i\)
  • linear predictor: \(\eta_i = \sum_{k=1}^p X_{ik} \beta_k\)
  • link function : \(g(\mu) = \eta = \log(\mu)\)
    • Note: the \(\log\) here is the natural log
    • since \(e^x\) is the inverse of \(\log(x)\), then \(\eta_i = \log(\mu_i)\) can be transformed into \(\mu_i = e^{\eta_i}\)
  • result: the likelihood model \[\begin{aligned} L(\beta) & = \prod_{i=1}^n (y_i !)^{-1} \mu_i^{y_i}e^{-\mu_i}\\ (substitute~\mu_i = e^{\eta_i}) & = \prod_{i=1}^n \frac{(e^{\eta_i})^{y_i}}{y_i! e^{e^{\eta_i}}}\\ (transform) &= \prod_{i=1}^n \frac{\exp(\eta_i y_i)}{y_i! \exp(e^{\eta_i})}\\ (taking~\log~of~both~sides)~\mathcal{L}(\beta) & = \sum_{i=1}^n \eta_i y_i - \sum_{i=1}^n e^{\eta_i} - \sum_{i=1}^n log(y_i!) \\ (since~y_i~is~given,~we~can~ignore~\log y_i!)~\mathcal{L}(\beta) & \propto \sum_{i=1}^n \eta_i y_i - \sum_{i=1}^n e^{\eta_i}\\ (substitute~\eta_i= \sum_{k=1}^p X_{ik} \beta_k) \Rightarrow \mathcal{L}(\beta) & \propto \sum_{i=1}^n y_i \left(\sum_{k=1}^p X_{ik}\beta_k\right) - \sum_{i=1}^n \exp \left(\sum_{k=1}^p X_{ik} \beta_k \right) \\ \end{aligned}\]
    • maximizing the log likelihood \(\mathcal{L}(beta)\) (solving for \(\frac{\partial \mathcal{L}}{\partial \beta} = 0)\) would return a set of optimized coefficients \(\beta\) that will fit the data

16.4 Variances and Quasi-Likelihoods

  • in each of the linear/Bernoulli/Poisson cases, the only term in the likelihood functions that depend on the data is \[\sum_{i=1}^n y_i \eta_i =\sum_{i=1}^n y_i\sum_{k=1}^p X_{ik} \beta_k = \sum_{k=1}^p \beta_k\sum_{i=1}^n X_{ik} y_i\]
  • this means that we don’t need need all of the data collected to maximize the likelihoods/find the coefficients \(\beta\), but only need \(\sum_{i=1}^n X_{ik} y_i\)
    • Note: this simplification is a consequence of choosing “canonical” link functions, \(g(\mu)\), to be in specific forms
  • [Derivation needed] all models achieve their maximum at the root of the normal equations \[\sum_{i=1}^n \frac{(Y_i - \mu_i)}{Var(Y_i)}W_i = 0\] where \(W_i = \frac{\partial g^{-1}(\mu_i)}{\mu_i}\) or the derivative of the inverse of the link function
    • Note: this is similar to deriving the least square equation where the middle term must be set to 0 to find the solution (see Derivation for \(\beta\))
    • Note: \(\mu_i = g^{-1}(\eta_i) =g^{-1}\left(\sum_{k=1}^p X_{ik} \beta_k\right)\), the normal functions are really functions of \(\beta\)
  • the variance, \(Var(Y_i)\), is defined as
    • linear model: \(Var(Y_i) = \sigma^2\), where \(\sigma\) is constant
    • binomial model: \(Var(Y_i) = \mu_i (1 - \mu_i)\)
    • Poisson model: \(Var(Y_i) = \mu_i\)
  • for binomial and Poisson models, there are strict relationships between the mean and variance that can be easily tested from the data:
    • binomial: mean = \(\mu_i\), variance = \(\mu_i (1 - \mu_i)\)
    • Poisson: mean = \(\mu_i\), variance = \(\mu_i\)
  • it is often relevant to have a more flexible variance model (i.e. data doesn’t follow binomial/Poisson distributions exactly but are approximated; eg. assuming poisson, the mean and variance are not equal, but in theory should be equal to \(\mu_i\)), even if it doesn’t correspond to an actual likelihood, so we can add an extra parameter, \(\phi\), to the normal equations to form quasi-likelihood normal equations \[ binomial:~\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i (1 - \mu_i ) } W_i=0 \\ Poisson:~\sum_{i=1}^n \frac{(Y_i - \mu_i)}{\phi \mu_i} W_i=0 \] where \(W_i = \frac{\partial g^{-1}(\mu_i)}{\mu_i}\) or the derivative of the inverse of the link function
    • for R function glm(), its possible to specify for the model to solve using quasi-likelihood normal equations instead of normal equations through the parameter family = quasi-binomial and family = quasi-poisson respectively
    • Note: the quasi-likelihoods models generally same properties as normal GLM

16.5 Solving for Normal and Quasi-Likelihood Normal Equations

  • normal equations have to be solved iteratively
    • the results are \(\hat \beta_k\), estimated coefficients for the predictors
    • for quasi-likelihood normal equations, \(\hat \phi\) will be part of the results as well
    • in R, Newton/Raphson’s algorithm is used to solve the equations
    • Note: many of the ideas, interpretation, and conclusions derived from simple linear models are applicable to GLMs
  • predicted linear predictor responses are defined as \[\hat \eta = \sum_{k=1}^p X_k \hat \beta_k\]
  • predicted mean responses can be solved from \[\hat \mu = g^{-1}(\hat \eta)\]
  • coefficients are interpreted as the expected change in the link function of the expected response per unit change in \(X_k\) holding other regressors constant, or \[\beta_k = g(E[Y | X_k = x_k + 1, X_{\sim k} = x_{\sim k}]) - g(E[Y | X_k = x_k, X_{\sim k}=x_{\sim k}])\]
  • Note: responses and coefficients are on different scale (because data was transformed). Especially for coefficients the interpretation is done on the scale of the linear predictor, in binomial cases on the scale of logent, for Poisson cases on the scale of the log mean.
  • asymptotics are used for inference of results (see Statistical Inference course), that is all of these results are based on asymptotics which means that they require larger sample sizes. So if you have a GLM setting with a very small sample size, then it’s possible that things like the p-values aren’t really applicable.

17 General Linear Models - Binary Models

17.1 Odds

  • odds are useful in constructing logistic regression models and fairly easy to interpret
    • imagine flipping a coin with success probability \(p\)
      • if heads, you win \(X\)
      • if tails, you lose \(Y\)
    • how should \(X\) and \(Y\) be set so that the game is fair? \[E[earnings]= X p - Y (1 - p) = 0 \Rightarrow \frac{Y}{X} = \frac{p}{1 - p}\]
    • odds can be interpreted as “How much should you be willing to pay for a \(p\) probability of winning a dollar?”
      • if \(p > 0.5\), you have to pay more if you lose than you get if you win
      • if \(p < 0.5\), you have to pay less if you lose than you get if you win
  • odds are NOT probabilities
  • odds ratio of 1 = no difference in odds or 50% - 50%
    • \(p = 0.5 \Rightarrow odds = \frac{0.5}{1-0.5} = 1\)
  • log odds ratio of 0 = no difference in odds
    • \(p = 0.5 \Rightarrow log odds = \log\left(\frac{0.5}{1-0.5}\right) = \log(1) = 0\)
  • odds ratio < 0.5 or > 2 commonly a “moderate effect”
  • relative risk = ratios of probabilities instead of odds, and are often easier to interpret but harder to estimate \[\frac{Pr(W_i | S_i = 10)}{Pr(W_i | S_i = 0)}\]
    • Note: relative risks often have boundary problems as the range of \(\log(p)\) is \((-\infty,~0]\) where as the range of logit \(\frac{p}{1-p}\) is \((-\infty,\infty)\)
    • for small probabilities Relative Risk \(\approx\) Odds Ratio but they are not the same!
  • More on Odds Ratio

17.2 Example - Baltimore Ravens Win vs Loss

  • the data for this example can be found here
    • the data contains the records 20 games for Baltimore Ravens, a professional American Football team
    • there are 4 columns
      • ravenWinNum = 1 for Raven win, 0 for Raven loss
      • ravenWin = W for Raven win, L for Raven loss
      • ravenScore = score of the Raven team during the match
      • opponentScore = score of the Raven team during the match
# load the data
load("./data/ravensData.rda")
ravensData
##    ravenWinNum ravenWin ravenScore opponentScore
## 1            1        W         24             9
## 2            1        W         38            35
## 3            1        W         28            13
## 4            1        W         34            31
## 5            1        W         44            13
## 6            0        L         23            24
## 7            1        W         31            30
## 8            1        W         23            16
## 9            1        W          9             6
## 10           1        W         31            29
## 11           0        L         13            43
## 12           1        W         25            15
## 13           1        W         55            20
## 14           1        W         13            10
## 15           1        W         16            13
## 16           0        L         20            23
## 17           0        L         28            31
## 18           0        L         17            34
## 19           1        W         33            14
## 20           0        L         17            23
# boxplot of wins and losses
boxplot(ravenScore ~ ravenWin, ravensData, col=0x96, lwd=3, horizontal=TRUE,
col.lab="purple", col.main="purple", xlab="Ravens' Score",
main="Ravens' Wins and Losses vs Score")
abline(v=23, lwd=5, col="purple")

  • We see from the boxplot that in fact, about 3/4 of their losses are at or below a score of 23 (purple vertical line) and about 3/4 of their wins are at or above it. (Remember that the purple boxes represent 50% of the samples, and the “T’s” 25%.)
  • There were 9 games in which the Ravens scored 23 points or less. They won 4 of these games, so we might guess their probability of winning, given that they score 23 points or less, is about 1/2.
  • There were 11 games in which the Ravens scored 24 points or more. They won all but one of these.
  • We see a fairly rapid transition in the Ravens’ win/loss record between 23 and 28 points. At 23 points and below they win about half their games, between 24 and 28 points they win 3 of 4, and above 28 points they win them all. From this, we get a very crude idea of the correspondence between points scored and the probability of a win. We get an S shaped curve.
# plot probabilities from the data
plot(c(3,23,29,55), c(0.5, 0.5, 1.0, 1.0), type='l', lwd=5, col="purple", col.lab="purple", ylim=c(0.25,1),
xlab="Ravens' Score", ylab="Probability of a Ravens win", col.main="purple",
main="Crude estimate of Ravens' win probability\nvs the points which they score.")

  • Of course, we would expect a real curve to be smoother. We would not, for instance, expect the Ravens to win half the games in which they scored zero points, nor to win all the games in which they scored more than 28. A generalized linear model which has these properties supposes that the log odds of a win depend linearly on the score. That is, \(\log(p/(1-p)) = \beta_0 + \beta_1 \times \text{score}\). The link function, \(\log(p/(1-p))\), is called the logit, and the process of finding the best \(\beta_0\) and \(\beta_1\), is called logistic regression. The “best” \(\beta_0\) and \(\beta_1\) are those which maximize the likelihood of the actual win/loss record. Based on the score of a game, \(\beta_0\) and \(\beta_1\) give us a log odds, which we can convert to a probability, \(p\), of a win. We would like \(p\) to be high for the scores of winning games, and low for the scores of losses. Before we discuss the logisitic regression, we see how the simple linear regression would perform.

17.3 Example - Simple Linear Regression

  • simple linear regression can be used model win vs loss for the Ravens \[ W_i = \beta_0 + \beta_1 S_i + \epsilon_i \]
    • \(W_i\) = binary outcome, 1 if a Ravens win, 0 if not
    • \(S_i\) = number of points Ravens scored
    • \(\beta_0\) = probability of a Ravens win if they score 0 points
    • \(\beta_1\) = increase in probability of a Ravens win for each additional point
    • \(\epsilon_i\) = residual variation, error
  • the expected value for the model is defined as \[E[W_i | S_i, \beta_0, \beta_1] = \beta_0 + \beta_1 S_i\]
  • however, the model wouldn’t work well as the predicted results won’t be 0 vs 1
    • the error term, \(\epsilon_i\), is assumed to be continuous and normally distributed, meaning that the prediction will likely be a decimal
    • therefore, this is not a good assumption for the model
# perform linear regression
summary(lm(ravenWinNum ~ ravenScore, data = ravensData))
## 
## Call:
## lm(formula = ravenWinNum ~ ravenScore, data = ravensData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7302 -0.5076  0.1824  0.3215  0.5719 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 0.285032   0.256643   1.111   0.2814  
## ravenScore  0.015899   0.009059   1.755   0.0963 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4464 on 18 degrees of freedom
## Multiple R-squared:  0.1461, Adjusted R-squared:  0.09868 
## F-statistic:  3.08 on 1 and 18 DF,  p-value: 0.09625
  • as expected, the model produces a poor fit for the data (\(R^2_{adj} =\) 0.0987)
  • adding a threshold to the predicted outcome (i.e. if \(\hat W_i < 0.5, \hat W_i = 0\)) and using the model to predict the results would be viable
    • however, the coefficients for the model are not very interpretable

17.4 Example - Logistic Regression

  • Binary Outcome (0/1) \(W_i\)
  • Model the probability of Ravens win, defined as \[Pr(W_i | S_i, \beta_0, \beta_1)\]
  • odds is defined as \[\frac{Pr(W_i | S_i, \beta_0, \beta_1 )}{1-Pr(W_i | S_i, \beta_0, \beta_1)}\] which can take values from 0 to \(\infty\)
  • log odds or logit is defined as \[\log\left(\frac{Pr(W_i | S_i, \beta_0, \beta_1 )}{1-Pr(W_i | S_i, \beta_0, \beta_1)}\right)\] which can take values from \(-\infty\) to \(\infty\)
  • we can use the link function and linear predictors to construct the logistic regression model \[\begin{aligned} g(\mu_i) & = \log \left(\frac{\mu_i}{1 - \mu_i} \right) = \eta_i\\ (substitute~\mu_i = Pr(W_i | S_i, \beta_0, \beta_1))~g(\mu_i) & = \log\left(\frac{Pr(W_i | S_i, \beta_0, \beta_1)}{1-Pr(W_i | S_i, \beta_0, \beta_1)}\right) = \eta_i \\ (substitute~\eta_i=\beta_0 + \beta_1 S_i) \Rightarrow ~g(\mu_i) & = \log\left(\frac{Pr(W_i | S_i, \beta_0, \beta_1)}{1-Pr(W_i | S_i, \beta_0, \beta_1)}\right) = \beta_0 + \beta_1 S_i\\ \end{aligned} \] which can also be written as \[Pr(W_i | S_i, \beta_0, \beta_1 ) = \frac{\exp(\beta_0 + \beta_1 S_i)}{1 + \exp(\beta_0 + \beta_1 S_i)}\]
  • for the model \[\log\left(\frac{Pr(W_i | S_i, \beta_0, \beta_1)}{1-Pr(W_i | S_i, \beta_0, \beta_1)}\right) = \beta_0 + \beta_1 S_i\]
    • \(\beta_0\) = log odds of a Ravens win if they score zero points
    • \(\beta_1\) = log odds ratio of win probability for each point scored (compared to zero points) \[\beta_1 = \log\left(odds(S_i = S_i+1)\right) - \log\left(odds(S_i = S_i)\right) = \log\left(\frac{odds(S_i = S_i+1)}{odds(S_i = S_i)} \right)\]
    • \(\exp(\beta_1)\) = odds ratio of win probability for each point scored (compared to zero points) \[\exp(\beta_1) = \frac{odds(S_i = S_i+1)}{odds(S_i = S_i)}\]
  • we can leverage the manipulate function, vary \(\beta_0\) and \(\beta_1\) to fit logistic regression curves for simulated data
# set x values for the points to be plotted
x <- seq(-10, 10, length = 1000)
# "library(manipulate)" is needed to use the manipulate function
manipulate(
# plot the logistic regression curve
plot(x, exp(beta0 + beta1 * x) / (1 + exp(beta0 + beta1 * x)),
type = "l", lwd = 3, frame = FALSE),
# slider for beta1
beta1 = slider(-2, 2, step = .1, initial = 2),
# slider for beta0
beta0 = slider(-2, 2, step = .1, initial = 0)
)

  • we can use the glm(outcome ~ predictor, family = "binomial") to fit a logistic regression to the data
# run logistic regression on data
logRegRavens <- glm(ravenWinNum ~ ravenScore, data = ravensData, family="binomial")
# print summary
summary(logRegRavens)
## 
## Call:
## glm(formula = ravenWinNum ~ ravenScore, family = "binomial", 
##     data = ravensData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7575  -1.0999   0.5305   0.8060   1.4947  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.68001    1.55412  -1.081     0.28
## ravenScore   0.10658    0.06674   1.597     0.11
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24.435  on 19  degrees of freedom
## Residual deviance: 20.895  on 18  degrees of freedom
## AIC: 24.895
## 
## Number of Fisher Scoring iterations: 5
  • as we can see above, the coefficients \(\beta_0\) and \(\beta_1\) are -1.68, 0.107, which are interpreted to be the log odds ratios
  • we can convert the log ratios as well as the log confidence intervals to ratios and confidence intervals (in the same units as the data)
# take e^coefs to find the ratios
exp(logRegRavens$coeff)
## (Intercept)  ravenScore 
##   0.1863724   1.1124694
# take e^log confidence interval to find the confidence intervals
exp(confint(logRegRavens))
##                   2.5 %   97.5 %
## (Intercept) 0.005674966 3.106384
## ravenScore  0.996229662 1.303304
  • Note: \(\exp(x) \approx 1 + x\) for small values (close to 0) of x, this can be a quick way to estimate the coefficients
  • we can interpret the slope, \(\beta_1\) as 11.247 % increase in the odds (careful: not probabilities) of winning for every point scored
  • we can interpret the intercept \(\beta_0\) as 0.186 is the odds for Ravens winning if they scored 0 points
    • Note: similar to the intercept of a simple linear regression model, the intercept should be interpreted carefully as it is an extrapolated value from the model and may not hold practical meaning
  • Regarding the coefficients on the logit scale, for example for the score variable \(\beta_1\), you want to see whether or not the coefficient is close to zero or not. On the exponentiated scale you want to see whether or not it’s close to one. Because that will give us the standard error and a Z value and the P value
    • From the confidence interval for \(\beta_1\) we see the lower bound is below zero (suggesting a decrease in winning probability per 1 point increase in score), therefore it turns out not to be significant (due to a large standard error).
  • to calculate specific probability of winning for a given number of points \[Pr(W_i | S_i, \hat \beta_0, \hat \beta_1) = \frac{\exp(\hat \beta_0 + \hat \beta_1 S_i)}{1 + \exp(\hat \beta_0 + \hat \beta_1 S_i)}\]
  • the resulting logistic regression curve can be seen below, that is, the predictive responses put back on the probability scale
# plot probabilities from the data
plot(c(3,23,29,55), c(0.5, 0.5, 1.0, 1.0), type='l', lwd=5, col="purple", col.lab="purple", ylim=c(0.25,1),
xlab="Ravens' Score", ylab="Probability of a Ravens win", col.main="purple",
main="Ravens' win vs score probabilities: GLM maximum likelihood estimates\ncompared to crude estimates")
# plot the logistic regression
points(ravensData$ravenScore,logRegRavens$fitted,pch=19,col="blue",xlab="Score",ylab="Prob Ravens Win")
# add legend
legend('bottomright', c("Crude estimates", "GLM maximum likelihood estimates"), lwd=5, lty=1,
col=c("purple", "blue"))

  • The probabilities estimated by logistic regression using glm() are represented by the blue dots. It is more reasonable than our crude estimate in several respects: It increases smoothly with score, it estimates that 15 points give the Ravens a 50% chance of winning, that 28 points give them an 80% chance, and that 55 points make a win very likely (98%) but not absolutely certain.

  • Since we don’t have data below a score of 9, we can use R’s predict() function to see the model’s estimates for lower scores. Since this gives us log odds, we will have to convert to probabilities.

# calculate log odds
logodds<-predict(logRegRavens, data.frame(ravenScore=c(0,3,6)))
# convert to probabilities
exp(logodds)/(1+exp(logodds))
##         1         2         3 
## 0.1570943 0.2041977 0.2610505

17.5 Example - ANOVA for Logistic Regression

  • ANOVA can be performed on a single logistic regression, in which it will analyze the change in variances with addition of parameters in the model, or multiple nested logistic regression (similar to linear models)
# perform analysis of variance
anova(logRegRavens,test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: ravenWinNum
## 
## Terms added sequentially (first to last)
## 
## 
##            Df Deviance Resid. Df Resid. Dev Pr(>Chi)  
## NULL                          19     24.435           
## ravenScore  1   3.5398        18     20.895  0.05991 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • ANOVA returns information about the model, link function, response, as well as analysis of variance for adding terms
    • Df = change in degrees of freedom
      • the value 1 refers to adding the ravenScore parameter (slope)
    • Deviance = measure of goodness of model fit compare to the previous model
    • Resid. Dev = residual deviance for current model
    • Pr(>Chi) = used to evaluate the significance of the added parameter
  • in this case, the Deviance value of 3.54 is actually the difference between the deviance of our model, which includes a slope, and that of a model which includes only an intercept.
  • the corresponding p-value from the Chi Squared distribution is 0.06
    • The null hypothesis is that the coefficient of ravenScore is zero. Given the p-value we fail to reject the null. In other words, the score variable adds very little to a model which just guesses.
    • Note: Chi Squared distribution with 1 degree of freedom (2 parameters minus 1 parameter, or equivalently 19-18) is simply the squared of normal distribution, so z statistic of 2 corresponds to 95% for normal distribution indicates that deviance of 4 corresponds to approximately 5% in the Chi Squared distribution (which is what our result shows)

17.6 Further resources

18 General Linear Models - Poisson Models

18.1 Properties of Poisson Distribution

  • a set of data \(X\) is said to follow the Poisson distribution, or \(X \sim Poisson(t\lambda)\), if \[P(X = x) = \frac{(t\lambda)^x e^{-t\lambda}}{x!}\] where \(x = 0, 1, \ldots\)
    • \(\lambda\) = rate or expected count per unit time
    • \(t\) = monitoring time
  • mean of Poisson distribution is \(E[X] = t\lambda\), thus \(E[X / t] = \lambda\)
  • variance of the Poisson is \(Var(X) = t\lambda = \mu (mean)\)
    • Note: Poisson approaches a Gaussian/normal distribution as \(t\lambda\) gets large
  • below are the Poisson distributions for various values of \(\lambda\)
# set up 1x3 panel plot
par(mfrow = c(1, 3))
# Poisson distribution for t = 1, and lambda = 2
plot(0 : 10, dpois(0 : 10, lambda = 2), type = "h", frame = FALSE)
# Poisson distribution for t = 1, and lambda = 10
plot(0 : 20, dpois(0 : 20, lambda = 10), type = "h", frame = FALSE)
# Poisson distribution for t = 1, and lambda = 100
plot(0 : 200, dpois(0 : 200, lambda = 100), type = "h", frame = FALSE)

  • as we can see from above, for large values of \(\lambda\), the distribution looks like the Gaussian

18.2 Example - Leek Group Website Traffic

  • for this example, we will be modeling the daily traffic to Jeff Leek’s web site: http://biostat.jhsph.edu/~jleek/
  • for the purpose of the example, the time is always one day, so \(t = 1\), Poisson mean is interpreted as web hits per day
    • if \(t = 24\), we would be modeling web hits per hour
  • the data can be found here
# load data
load("./data/gaData.rda")
# convert the dates to proper formats
gaData$julian <- julian(gaData$date)
# look at the data
head(gaData)
##         date visits simplystats julian
## 1 2011-01-01      0           0  14975
## 2 2011-01-02      0           0  14976
## 3 2011-01-03      0           0  14977
## 4 2011-01-04      0           0  14978
## 5 2011-01-05      0           0  14979
## 6 2011-01-06      0           0  14980
# plot visits vs dates
plot(gaData$date,gaData$visits,pch=19,col="darkgrey",xlab="Date",ylab="Visits")

18.3 Example - Linear Regression

  • the traffic can be modeled using linear model as follows \[ NH_i = \beta_0 + \beta_1 D_i + \epsilon_i \]
    • \(NH_i\) = number of hits to the website
    • \(D_i\) = date
    • \(\beta_0\) = number of hits on day 0 (1970-01-01)
    • \(\beta_1\) = increase in number of hits per unit day
    • \(\epsilon_i\) = variation due to everything we didn’t measure
  • the expected outcome is defined as \[ E[NH_i | JD_i, \beta_0, \beta_1] = \beta_0 + \beta_1 JD_i\]
# plot the visits vs dates
plot(gaData$date,gaData$visits,pch=19,col="darkgrey",xlab="Date",ylab="Visits")
# perform linear regression
lm1 <- lm(gaData$visits ~ gaData$date)
# plot regression line
abline(lm1, col="red", lwd=3)

  • Clearly there’s some curvature there, maybe we should have put an x squared term in. But that would be our first approach to this, and honestly it wouldn’t be that bad. But the counts are kind of small, so it’s not the best thing in the world.
  • The interpretation isn’t great for linear models, then we’ll see some ways which in the next couple of slides, how we can tweak linear models to maybe get a slightly better interpretation. I think that of counts in web hits and things like that as things that you would want to think about on a relative scale and the linear model really treats it on a linear additive scale. So let’s think about how we could get relative interpretations from our linear model. The first thing we might try is taking the log of the outcome

18.4 Example - log Outcome

  • if we are interested in relative increases in web traffic, we can the natural log of the outcome, so the linear model becomes \[ \log(NH_i) = \beta_0 + \beta_1 D_i + \epsilon_i\]
    • \(\log(NH_i)\) = number of hits to the website
    • \(D_i\) = date
    • \(\beta_0\) = log number of hits on day 0 (1970-01-01)
    • \(\beta_1\) = increase in log number of hits per unit day
    • \(\epsilon_i\) = variation due to everything we didn’t measure
  • when we take the natural log of outcomes and fit a regression model, the exponentiated coefficients (\(e^{\beta_k}\)) estimate quantities based on the geometric means rather than the measured values
    • \(e^{E[\log(Y)]}\) = geometric mean of \(Y\)
      • the geometric mean is just exponentiating the arithmatic mean of the log data
      • geometric means are defined as \[e^{\frac{1}{n}\sum_{i=1}^n \log(y_i)} = (\prod_{i=1}^n y_i)^{1/n}\] which is the estimate for the population geometric mean
      • as we collect infinite amount of data, \((\prod_{i=1}^n y_i)^{1/n} \to E[\log(Y)]\) (geometric mean converges to the arithmatic mean of the log data)
    • exponentiated coefficients are interpretable with respect to geometric means
      • \(e^{\beta_0}\) = estimated geometric mean hits on day 0
      • \(e^{\beta_1}\) = estimated relative increase or decrease in geometric mean hits per day
      • This intercept doesn’t mean that much because January 1st 1970 is not a date that we care about in terms of number of web hits. So probably to make the intercept more interpretable, what we should have done is subtracted off the earliest date that we saw and started counting days from all of the remaining days in our data set. Then the exp of the estimated intercept would be the geometric mean hits on the first day of this data set.
    • Note: we can not take the natural log of zero counts, so often we need to adding a constant (i.e. 1) to construct the log model
      • adding the constant changes the interpretation of coefficient slightly
      • \(e^{\beta_1}\) is now the relative increase or decrease in geometric mean hits + 1 per day
  • the expected outcome is \[E[\log(NH_i | JD_i, \beta_0, \beta_1)] = \beta_0 + \beta_1 JD_i \]
round(exp(coef(lm(I(log(gaData$visits + 1)) ~ gaData$date))), 5)
## (Intercept) gaData$date 
##     0.00000     1.00231
  • the intercept which is kind of irrelevant in this case, because we are using data since January 1st 1970
  • then we get 1.002 for \(\beta_1\). This is on the exponentiated scale. This means that our model is estimating a 0.2% increase in web traffic per day

18.5 Example - Poisson Regression

  • the Poisson/log-linear model can be constructed as log of the mean \[\log\left(E[NH_i | D_i, \beta_0, \beta_1]\right) = \beta_0 + \beta_1 D_i\] or in other form \[E[NH_i | D_i, \beta_0, \beta_1] = \exp\left(\beta_0 + \beta_1 D_i\right)\]
    • \(NH_i\) = number of hits to the website
    • \(D_i\) = date
    • \(\beta_0\) = expected number of hits on day 0 (1970-01-01)
    • \(\beta_1\) = expected increase in number of hits per unit day
    • Note: Poisson model differs from the log outcome model in that the coefficients are interpreted naturally as expected value of outcome where as the log model is interpreted on the log scale of outcome
  • we can transform the Poisson model to \[E[NH_i | D_i, \beta_0, \beta_1] = \exp\left(\beta_0 + \beta_1 D_i\right) = \exp\left(\beta_0 \right)\exp\left(\beta_1 D_i\right)\]
    • If \(D_i\) is increased by one unit, \(E[NH_i | D_i, b_0, b_1]\) is multiplied by \(\exp\left(b_1\right)\), since \[\frac{\exp\left(\beta_0 + \beta_1 (D_i + 1)\right)}{\exp\left(\beta_0 + \beta_1 D_i\right)} = \exp(\beta_1)\]
    • So our coefficient, e to our slope coefficient, is interpreted as the relative increase or decrease in the mean per one unit change in the regressor
      • if we exponentiate our coefficient, we’re going to be looking at whether or not they’re close to 1
      • if we leave them on the log scale, we’re going to be looking at whether or not they’re close to 0
    • in our example \(\exp(\beta_1)\) can therefore be interpreted as the expected relative increase/decrease in web traffic hits per one day increase
  • glm(outcome~predictor, family = "poisson") = performs Poisson regression
# plot visits vs dates
plot(gaData$date,gaData$visits,pch=19,col="darkgrey",xlab="Date",ylab="Visits")
# construct Poisson regression model
glm1 <- glm(gaData$visits ~ gaData$date,family="poisson")
# plot linear regression line in red
abline(lm1,col="red",lwd=3)
# plot Poisson regression line in blue (represents mean number of visits per day, which is lambda)
lines(gaData$date,glm1$fitted,col="blue",lwd=3)

  • Note: the Poisson fit is non-linear since it is linear only on the log of the mean scale

18.6 Example - Robust Standard Errors with Poisson Regression

  • variance of the Poisson distribution is defined to be the mean of the distribution, so we would expect the variance to increase with higher values of \(X\)
  • below are the residuals vs fitted value plot for the Poisson regression model
# plot residuals vs fitted values
plot(glm1$fitted, glm1$residuals, pch=19, col="grey", ylab="Residuals", xlab="Fitted")

  • as we can see from above, the residuals don’t appear to be increasing with higher fitted values (variance should increase with higher values of \(X\))
  • even if the mean model is correct in principle, there could always be a certain degree of model mis-specification
  • to account for mis-specifications for the model, we can use
    1. glm(outcome~predictor, family = "quasi-poisson") = introduces an additional multiplicative factor \(\phi\) to denominator of model so that the variance is \(\phi \mu\) rather than just \(\mu\) (see Variances and Quasi-Likelihoods)
    2. more generally, robust standard errors (effectively constructing wider confidence intervals) can be used
  • model agnostic standard errors, implemented through the sandwich package, is one way to calculate the robust standard errors
    • algorithm assumes the mean relationship is specified correctly and attempts to get a general estimate of the variance that isn’t highly dependent on the model
    • it uses assumption of large sample sizes and asymptotics to estimate the confidence intervals that is robust to model mis-specification
    • Note: more information can be found at http://stackoverflow.com/questions/3817182/vcovhc-and-confidence-interval
# load sandwich package
library(sandwich)
# compute
confint.agnostic <- function (object, parm, level = 0.95, ...)
{
cf <- coef(object); pnames <- names(cf)
if (missing(parm))
parm <- pnames
else if (is.numeric(parm))
parm <- pnames[parm]
a <- (1 - level)/2; a <- c(a, 1 - a)
pct <- stats:::format.perc(a, 3)
fac <- qnorm(a)
ci <- array(NA, dim = c(length(parm), 2L), dimnames = list(parm,
pct))
ses <- sqrt(diag(sandwich::vcovHC(object)))[parm]
ci[] <- cf[parm] + ses %o% fac
ci
}
# regular confidence interval from Poisson Model
confint(glm1)
##                     2.5 %        97.5 %
## (Intercept) -34.346577587 -31.159715656
## gaData$date   0.002190043   0.002396461
# model agnostic standard errors
confint.agnostic(glm1)
##                     2.5 %        97.5 %
## (Intercept) -36.362674594 -29.136997254
## gaData$date   0.002058147   0.002527955
  • as we can see from above, the robust standard error produced slightly wider confidence intervals
  • Note: above CIs are on log scale. Use exp(confint(glm1)) to get more interpretable values.

18.7 Example - Model Shortcoming

  • Our model looks like a pretty good description of the data, but no model is perfect and we can often learn about a data generation process by looking for a model’s shortcomings.
idx <- 1:60
par(mfrow=c(1,2))
plot(visits ~ date, gaData[idx,],
main='"Zero Inflation" 2011',
xlab="Date", ylab="Visits", pch=21, bg='darkgrey')
lines(gaData$date[idx], glm1$fitted.values[idx], lwd=5, col="red")
points(as.Date("2011-01-10"), 5, cex=12, lwd=5, col="red")
text(as.Date("2011-01-5"), 9, "Zero Inflation", pos=4)
plot(gaData$date, gaData$visits-glm1$fitted.values, pch=21, bg='darkgrey', main="Variance != Mean?", xlab="Date", ylab="Visits over Average")
lines(gaData$date, sqrt(glm1$fitted.values), lwd=5, lty=2, col='red')
lines(gaData$date, -sqrt(glm1$fitted.values), lwd=5, lty=2, col='red')

  • As shown in the figure, one thing about our model is ‘zero inflation’ in the first two weeks of January 2011, before the site had any visits. The model systematically overestimates the number of visits during this time.
  • A less obvious thing is that the standard deviation of the data may be increasing with lambda faster than a Poisson model allows. This possibility can be seen in the rightmost plot by visually comparing the spread of grey dots with the standard deviation predicted by the model (red dashes.) Also, there are four or five bursts of popularity during which the number of visits far exceeds two standard deviations over average. Perhaps these are due to mentions on another site.
  • It seems that at least some of them are. The simplystats column of our data records the number of visits to the Leek Group site which come from the related site, Simply Statistics. (ie., visits due to clicks on a link to the Leek Group which appeared in a Simply Statisics post.)
par(mfrow=c(1,1))
plot(visits ~ date, gaData, pch=21, bg='darkgrey', main="Bursts of Popularity", xlab="Date", ylab="Visits")
points(simplystats ~ date, gaData, pch=19, col='red')
lines(simplystats ~ date, gaData, lwd=1.5, col='darkgreen')
legend('topleft', c("Visits", "Visits from Simply Statistics"), pch=c(21,19),
pt.bg=c("darkgrey", "black"), col=c("black","red"), bg="white")

  • In the figure, the maximum number of visits occurred in late 2012. Visits from the Simply Statistics blog were also at their maximum that day. To find the exact date we can use which.max(hits[,'visits']).
# maximum number of visits at this date
idx <- which.max(gaData[,"visits"])
gaData[idx,]
##           date visits simplystats julian
## 704 2012-12-04     94          64  15678
  • The maximum number of visits, 94, occurred on December 4, 2012, of which 64 came from the Simply Statistics blog. We might consider the 64 visits to be a special event, over and above normal. Can the difference, 94-64=30 visits, be attributed to normal traffic as estimated by our model? To check, we will need the value of lambda (=expected number of visits) on December 4, 2012. The number of visits explained by our model on December 4, 2012 are those of a Poisson random variable with mean lambda. We can find the 95th percentile of this distribution using qpois(.95, lambda).
# expected number of visits (= lambda) at date with maximum visit
lambda <- glm1$fitted.values[idx]
# number of visits at the 95th percentile given by our model
qpois(.95, lambda)
## [1] 33
  • So, 95% of the time we would see 33 or fewer visits, hence 30 visits would not be rare according to our model. It would seem that on December 4, 2012, the very high number of visits was due to references from Simply Statistics. To gauge the importance of references from Simply Statistics we may wish to model the proportion of traffic such references represent. Doing so will also illustrate the use of glm’s parameter, offset, to model frequencies and proportions.

18.8 Example - Rates

  • A Poisson process generates counts, and counts are whole numbers, 0, 1, 2, 3, etc. A proportion (or rate) is a fraction. So how can a Poisson process model a proportion? The trick is to include the denominator of the fraction, or more precisely its log, as an offset.
    • Assume an instance where you have a count, and then you have some offset that should tell you how large or small the count should be.
    • For example, if I’m counting failures of my nuclear pumps that I mentioned before, I should have more failures if I monitored them for a longer time. If I’m counting the number of flu cases, I should have more flu cases if I’m looking at a larger population. So I should have more flu cases in a bigger city than I would have in a smaller city.
    • So in all these cases, if the counts that we’re interested in have some term that we really want to interpret our count relative to that, whether it’s a unit of time, person, time at risk, total sample size, then the first thing we note is that we want to actually interpret not the expected value of the outcome, but the expected value of the outcome divided by this relative term
  • in this case, we are looking at the number of web hits originating from Simply Statistic blog, relative to the total number of web hits \[\begin{aligned} E[NHSS_i | D_i, \beta_0, \beta_1]/NH_i & = \exp\left(\beta_0 + \beta_1 D_i\right) \\ (take~\log~of~both~sides)~\log\left(E[NHSS_i | D_i, \beta_0, \beta_1]\right) - \log(NH_i) & = \beta_0 + \beta_1 D_i \\ (move~\log(NH_i)~to~right~side)~\log\left(E[NHSS_i | D_i, \beta_0, \beta_1]\right) & = \log(NH_i) + \beta_0 + \beta_1 D_i \\ \end{aligned}\]
    • when offset term \(\log(NH_i)\) is present in the Poisson model, the interpretation of the coefficients will be relative to the offset quantity
    • it’s important to recognize that the fitted response doesn’t change
    • Example: to convert the outcome from daily data to hourly, we can add a factor 24 so that the model becomes \[\begin{aligned} E[NHSS_i | D_i, \beta_0, \beta_1]/24 & = \exp\left(\beta_0 + \beta_1 D_i\right) \\ (take~\log~of~both~sides)~\log\left(E[NHSS_i | D_i, \beta_0, \beta_1]\right) - \log(24) & = \beta_0 + \beta_1 D_i \\ (move~\log(24)~to~right~side)~\log\left(E[NHSS_i | D_i, \beta_0, \beta_1]\right) & = \log(24) + \log(NH_i) + \beta_0 + \beta_1 D_i \\ \end{aligned}\]
  • As shown, we would like to model the fraction simplystats/visits, but to avoid division by zero we’ll actually use simplystats/(visits+1).
  • We fit the Poisson model now with an offset so that the model is interpreted with respect to the number of visits
    • glm(outcome ~ predictor, offset = log(offset), family = "poisson") = perform Poisson regression with offset
    • glm(outcome ~ predictor + log(offset)) = produces the same result
    • *Note: glm’s parameter, offset, fixes the coefficient of the offset to 1, so that the fraction simplystats/visits can hold
  • below is a plot of the difference between the GLM1 fitted rates, which was to the number of web hits, versus the GLM2 fitted rates, which was the relative number of web hits originating from Simply Statistics (So these blue points are adjusted for the red points in a sense)
# perform Poisson regression with offset for number of visits
glm2 <- glm(simplystats ~ date, offset=log(visits+1), family="poisson", data=gaData)
# plot the fitted means (from simply statistics) in BLUE
plot(gaData$date, glm2$fitted,col="blue", pch=19,xlab="Date", ylab="Fitted Counts")
# plot the fitted means (total visit) in RED
points(gaData$date, glm1$fitted, col="red", pch=19)

# plot the rates for simply stats (actual data points)
plot(gaData$date, gaData$simplystats/(gaData$visits+1), col="grey",xlab="Date",
ylab="Fitted Rates", pch=19)
# plot the fitted rates for simply stats (visit/day) (fitted model)
lines(gaData$date, glm2$fitted/(gaData$visits+1), col="blue", lwd=3)

  • Above plot shows the fitted model (blue line) relative to the data (gray points): there are a lot of zeros early on, and then it took off quite a bit
  • Although summary(glm2) will show that the estimated coefficients are significantly different than zero, the model is actually not impressive. We can illustrate why by looking at December 4, 2012, once again. On that day there were 64 actual visits from Simply Statistics. However, according to glm2, 64 visits would be extremely unlikely. You can verify this weakness in the model by finding glm2’s 95th percentile for that day.
# number of visits at the 95th percentile given by our model (should be 64)
qpois(.95, glm2$fitted.values[idx])
## [1] 47

18.9 Further Resources

19 Fit Functions Using Linear Models

19.1 Considerations

  • basis = the collection of regressors
  • single knot point terms can fit hockey-stick-like processes
  • these bases can be used in GLMs (as an additional term/predictor) as well
  • issue with these approaches is the large number of parameters introduced
    • requires some method of regularization, or penalize for large number of parameters (see Practical Machine Learning course)
    • introducing large number of knots have significant consequences

19.2 Example - Fitting Piecewise Linear Function

# simulate data: sin curve + noise
n <- 500; x <- seq(0, 4 * pi, length = n); y <- sin(x) + rnorm(n, sd = .3)
# define 20 knot points, equally spread along the data
knots <- seq(0, 8 * pi, length = 20);
# define the ()+ function to only take the values that are positive after the knot pt
splineTerms <- sapply(knots, function(knot) (x > knot) * (x - knot))
# define the predictors as X and spline term
xMat <- cbind(x, splineTerms)
# fit linear models for y vs predictors
yhat <- predict(lm(y ~ xMat))
# plot data points (x, y)
plot(x, y, frame = FALSE, pch = 21, bg = "lightblue")
# plot fitted values
lines(x, yhat, col = "red", lwd = 2)

19.3 Example - Fitting Piecewise Quadratic Function

  • adding squared terms makes it continuous AND differentiable at the knot points, and the model becomes \[Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \sum_{k=1}^d (x_i - \xi_k)_+^2 \gamma_k + \epsilon_{i}\] where \((a)^2_+ = a^2\) if \(a > 0\) and \(0\) otherwise
  • adding cubic terms makes it twice continuously differentiable at the knot points, etcetera
# define the knot terms in the model
splineTerms <- sapply(knots, function(knot) (x > knot) * (x - knot)^2)
# define the predictors as x, x^2 and knot terms
xMat <- cbind(x, x^2, splineTerms)
# fit linear models for y vs predictors
yhat <- predict(lm(y ~ xMat))
# plot data points (x, y)
plot(x, y, frame = FALSE, pch = 21, bg = "lightblue")
# plot fitted values
lines(x, yhat, col = "red", lwd = 2)

19.4 Example - Harmonics using Linear Models

  • using Fourier bases
  • discrete Fourier transforms = instance of linear regression model, use sin and cosine functions as basis to fit data
  • to demonstrate this, we will generate 2 seconds of sound data using sin waves, simulate a chord, and apply linear regression to find out which notes are playing
# frequencies for white keys from c4 to c5
notes4 <- c(261.63, 293.66, 329.63, 349.23, 392.00, 440.00, 493.88, 523.25)
# generate sequence for 2 seconds
t <- seq(0, 2, by = .001); n <- length(t)
# define data for c4 e4 g4 using sine waves with their frequencies
c4 <- sin(2 * pi * notes4[1] * t); e4 <- sin(2 * pi * notes4[3] * t);
g4 <- sin(2 * pi * notes4[5] * t)
# define data for a chord and add a bit of noise
chord <- c4 + e4 + g4 + rnorm(n, 0, 0.3)
# generate profile data for all notes
x <- sapply(notes4, function(freq) sin(2 * pi * freq * t))
# fit the chord using the profiles for all notes
fit <- lm(chord ~ x - 1)
  • after generating the data and running the linear regression, we can plot the results to see if the notes are correctly identified
# set up plot
plot(c(0, 9), c(0, 1.5), xlab = "Note", ylab = "Coef^2", axes = FALSE, frame = TRUE, type = "n")
# set up axes
axis(2)
axis(1, at = 1 : 8, labels = c("c4", "d4", "e4", "f4", "g4", "a4", "b4", "c5"))
# add vertical lines for each note
for (i in 1 : 8) abline(v = i, lwd = 3, col = grey(.8))
# plot the linear regression fits
lines(c(0, 1 : 8, 9), c(0, coef(fit)^2, 0), type = "l", lwd = 3, col = "red")

  • as we can see from above, the correct notes were identified (using only sine functions)
  • we can also use the Fast Fourier Transforms to identify the notes (using sine and cosine functions)
    • fft(data) = performs fast Fourier transforms on provided data
    • Re(data) = subset to only the real components of the complex data
# perform fast fourier transforms on the chord matrix
a <- fft(chord)
# plot only the real components/frequencies of the fft
plot(Re(a)^2, type = "l")

  • Note: the algorithm checks for all possible notes at all frequencies it can detect, which is why the peaks are very high in magnitude
  • Note: the symmetric display of the notes are due to periodic symmetries of the sine functions